clamsproject / clams-python

CLAMS SDK for python
http://sdk.clams.ai/
Apache License 2.0
4 stars 1 forks source link

output specification in app metadata #51

Closed keighrim closed 3 years ago

keighrim commented 3 years ago

This thread to discuss the design of output of apps as specified in their appmetadata, particularly in theproduces field.

marcverhagen commented 3 years ago

Currently all applications have a list of types in the produces field, all represented by URIs. Some examples (somewhat edited to make it all more compact):

Application Produces
segmenter [TimeFrame]
tesseract [BoundingBox, Alignment, TextDocument]
east [BoundingBox]
kaldi [TextDocument, TimeFrame, Alignment, Uri.TOKEN]
spacy [Uri.TOKEN, Uri.POS, Uri.LEMMA, Uri.NCHUNK, Uri.SENTENCE, Uri.NE]

Two things to consider and/or change.

First is to introduce the equivalent of discriminators for representing what set of property values an app adds. For example, if an app fills in the frameType property we may want to say that the possible values are "speech" and "non-speech". This could look like "TimeFrame#frameType=speech,non-speech", which is clearly not a URI so we need to think about whether we allow that or how else to do this. If we have an audio segmentation app that generates time frames of these types we would have:

"produces": [
    "http://mmif.clams.ai/0.2.2/vocabulary/TimeFrame",
    "TimeFrame#frameType=speech,non-speech"]

The first element is a URI for a type, the second one refers back to that type and introduces specific values of properties. Instead of the enumeration we could have a URL that lists all values.

The other thing to consider is that the kind of input or the kind of parameters handed to the app may have impact on the output. For example, if Tesseract is given input with bounding boxes it may not create any bounding boxes itself. Or if spaCy is given a parameter "add-named-entities=false" then it may not produce anything of type Uri.NE.

We can consider the produces property as specifying what kind of output the app can produce, but it may choose to not produce all that the property lists. When an app runs it will determine what it's output is going to be, which could be whatever the app metadata say or a subset thereof, and then use that to determine what it puts in the view metadata.

As a side remark, this means that if we use the app metadata to validate app output, we should not require identity of the produces metadata of the app and the view metadata created by the app.

keighrim commented 3 years ago

On the first point, it could be a bit closer to the standard/universal way if we use query strings to specify properties, instead of using an anchor. Query string also allows us to put multiple properties specified in a single line. For example;

"produces": [
    "http://mmif.clams.ai/x.y.z/vocabulary/TimeFrame?frameType=speech,non-speech&unit=millisecond"

I wonder though if there'd be a way to embed a long list of possible values. For an extreme example, we can imagine specifying types of named entities for a NER tool that uses something like 112 different entity types. In LAPPS we once tired to have a secondary namespace to assign fixed names to the sets of tags (e.g. ptb for the penn tree bank tagset), but that didn't go too far as long as I remember.

marcverhagen commented 3 years ago

I think that both the anchors and the query string are standard/universal, with the difference being that with the former we suggest that there is a URL with that anchor and with the latter we suggest that one can send that query. With the query it is easier to bundle a lot of stuff in one URL. With both we have the issue that it is implied whether we use an AND or an OR (if for example we use more than one property).

For LAPPS we did have tag sets in the list of discriminators where we referred to third party URLs (for example the short name tags-pos-penntb was associated with http://vocab.lappsgrid.org/1.3.0-SNAPSHOT/ns/tagset/pos#penntb). This never got out of the snapshot stage though.

keighrim commented 3 years ago

Two above comments were unnecessary as we already decided the basic structure of outputs in https://github.com/clamsproject/clams-apps/issues/17#issuecomment-777043221. One thing, I think, is missing from the discussion so far is how we want to encode what types an app will output and what an app can output (like required in input specification). I think this is important for type checking system of the pipeline builder to decide one can noodle one app to its downstream.

keighrim commented 3 years ago

closed via #53 .