clamsproject / clams-python

CLAMS SDK for python
http://sdk.clams.ai/
Apache License 2.0
4 stars 1 forks source link

at_type property value specification for collections in app metadata #194

Closed keighrim closed 3 months ago

keighrim commented 4 months ago

Because

Currently in v1.1.1 (3a9bfc862027d82770d8660d32d62babea383498), input and output annotation specs in app metadata only support simple atomic types for specifying an additional property and their fixed value. One can add multiple type specifications with different property values to represent a range of possible values for the property ("one-of" spec, see https://github.com/clamsproject/clams-python/issues/77), as for the output section of the app metadata always implies "any-of" the specified types.

https://github.com/clamsproject/clams-python/blob/3a9bfc862027d82770d8660d32d62babea383498/clams/appmetadata/__init__.py#L15

https://github.com/clamsproject/clams-python/blob/3a9bfc862027d82770d8660d32d62babea383498/clams/appmetadata/__init__.py#L72-L75

However, aside from the fact that this hacky representation for "one-of" types is hardly agreeable to be particularly readable (as in https://apps.clams.ai/swt-detection/v3.0/), currently metadata schema doesn't allow a collection of values to an output type property that presents either "all-of" specification or more complex data types. (see https://github.com/clamsproject/mmif-python/issues/252)

Done when

Based on the conclusion from https://github.com/clamsproject/mmif-python/issues/252, the app metadata schema is updated and relevant helper methods are also updated.

Additional context

No response

keighrim commented 4 months ago

Until now, the most frequent use case of the "one-of" output spec was to record possible classification results. That is, we used "one-of" representation to represent the labelset information in the app metadata. However, as discussed in https://github.com/clamsproject/mmif/issues/218 , we are likely to add a dedicated labelset prop to all Annotation subtypes, and I'd assume that would practically solve the problems with "one-of" representation, at least.

keighrim commented 4 months ago

tl;dr

In app metadata,

  1. a dev can't say that this app wants a X type with y property without specifying the value of y.
    • proposal: "*" string to have a special wild card meaning (any value)
  2. a dev can say that this app outputs a X type with y property of which value is one of [a, b, c], but it's utterly ugly.
    • proposal: none at the time.

Open to any other suggestions!


to specify a property with "a" value

This is the only scenario that the current app metadata I/O spec is designed for.

on input side

In the simplest case, suppose an app wants a TimeFrame with label="speech" property (and a AudioDocument without any further property specification). The app metadata can be written as

"input": [
    { "@type": "AudioDocument", "required": true },
    { "@type": "TimeFrame", "required": true, "properties":  { "label": "speech" }}
]

on output side

As the output, the same app returns an MMIF with TextDocument and Alignment for each speech-labeled time frame.

"output": [
    { "@type": "Alignment"},  # no `required`: all items in this output list always are "optional" (app can always fail to generate any annotations)
    { "@type": "TextDocument", "properties": {"@lang": "en"}}  
]

(Yes, this app is an English ASR app, but to simplify the example, it doesn't generate any "linguistic" unit annotations)

to specify a property with "any" value from a set of possible values

This scenario is the reason this issue is originally opened. For instance, the current SWT app (v4) outputs

...
"annotations": [
  { "@type": "TimeFrame",
    "properties": {
      "label": "chyron",
      "classification": { "chyron": 0.8210625767707824 },
      "targets": [ "tp_1", "tp_2", "tp_3", "tp_4", "tp_5" ],
      "representatives": [ "tp_2" ],
      "id": "tf_1"
    }
  },
  { "@type": "TimePoint",
    "properties": {
      "timePoint": 2000,
      "label": "I",
      "classification": {
        "bars": 3.4400006825308083e-06,
        "slate": 1.3836956497925712e-05,
        "chyron": 0.966004173271358,
        "credits": 0.0013315067626535892,
        "NEG": 0.03264704300880794
      },
      "id": "tp_1"
    }
  },
  ...  # and more
]

on output side

So currently the output is specified in (https://apps.clams.ai/swt-detection/v4.0/metadata.json)

  "output": [{
    "@type": "TimeFrame",
    "properties": {
      "timeUnit": "milliseconds",
      "labelset": ["bars", "slate", "chyron", "credits", "NEG"]
    }
  }, {
    "@type": "TimePoint",
    "properties": {
      "timeUnit": "milliseconds",
      "labelset": ["bars", "slate", "chyron", "credits", "NEG"]
    }
  }],

which doesn't say anything about label, classification, or representatives properties in the actual output annotations. But that should be just fine, as those properties are "variables" and the values are different every time.

on input side

Now, let's take a simplified OCR app that's developed to target the SWT app. Namely, this OCR app wants MMIF input with TimeFrames with label="slate" or label="chyron" orlabel="credits". Based on "one-of"/disjunctive interpretation of lists in input list (#77), we can use something like this (@snewman-aa please confirm if this is the case now for the OCR apps you recently worked)

  "input": [
    [  # nested list has "disjunctive" interpretation
      { "@type": "TimeFrame",
        "properties": { "label": "slate"}
      },
      { "@type": "TimeFrame",
        "properties": { "label": "chyron"}
      },
      { "@type": "TimeFrame",
        "properties": { "label": "credit"}
      }
    ]
  ],

Now, we can infer just form the TimeFrame::labelset spec of the app metadata of the SWT that it generates MMIF that's suitable for this OCR app, it might not be so obvious to a type-theoretical workflow engine, if some intelligent type coercion system is missing. To ensure an "easier" type matching, we can coordinate the SWT app to specify all possible values of label in the output section, as we used to do in v3 of the SWT app.

  "output": [{
    "@type": "http://mmif.clams.ai/vocabulary/TimeFrame/v1",
    "properties": {
      "timeUnit": "milliseconds",
      "labelset": ["bars", "slate", "chyron", "credits", "NEG"],
      "label": "bars"
    }
  }, {
    "@type": "http://mmif.clams.ai/vocabulary/TimeFrame/v1",
    "properties": {
      "timeUnit": "milliseconds",
      "labelset": ["bars", "slate", "chyron", "credits", "NEG"],
      "label": "slate"
    }
  }, {
    "@type": "http://mmif.clams.ai/vocabulary/TimeFrame/v1",
    "properties": {
      "timeUnit": "milliseconds",
      "labelset": ["bars", "slate", "chyron", "credits", "NEG"],
      "label": "chyron"
    }
  }, {
    "@type": "http://mmif.clams.ai/vocabulary/TimeFrame/v1",
    "properties": {
      "timeUnit": "milliseconds",
      "labelset": ["bars", "slate", "chyron", "credits", "NEG"],
      "label": "credits"
    }
  }, {
    "@type": "http://mmif.clams.ai/vocabulary/TimePoint/v1",
    "properties": {
      "labelset": ["bars", "slate", "chyron", "credits", "NEG"],
      "timeUnit": "milliseconds"
    }
  }],

Of course it's ugly, and actually very much redundant now due to the labelset property of the TimeFrame annotation. Alternatively, we can specify the OCR app to look for TFs with labelset= super set of ["slate", "chyron", "credit"] (which we don't have a way to specify "super-set-of" part). Since currently such type matching in workflows is done all by human, it's not a critical concern at the moment. So we can come back to this issue later, when we have a better understanding of the actual type-theoretical workflow engine.

to specify a property with "any" value

Now, let's say the OCR app uses existence of representatives property in the TF annotation to fine target annotations, instead of label. As pointed out in the above, representatives is a "variable" and since we can't specify a fixed value (or a fixed list of values) of it in the app metadata, we can't specify that this app requires TFs with representatives property. Namely, we can't do this.

  "input": [
    { "@type": "TimeFrame",
      "properties": { "representatives": ??? }  # we still want to make sure the this app wants TF with `representatives` prop!
    }
  ],

To solve this problem, we can introduce a special string "*" to have a special wild card meaning (any value). One obvious potential problem is that a property can actually have an asterisk as its value (e.g. Specifying a delimiter in a NLP app that handles some structured text).

keighrim commented 3 months ago

fixed via https://github.com/clamsproject/apps/commit/b62072901f6d15cfb20cab38d7a8c21c696e2e4c and https://github.com/clamsproject/clams-python/commit/d77cf6467e2553f6575b5250482d08d184416999