proposing subtypes of `TextDocument`

keighrim commented 4 months ago

New Feature Summary

With a number of recent development, I'd like to propose more vocab types that are subcategories of TextDocument (all names are tentative in the proposal)

Transcript: a subtype of text document, always aligned to annotations in non-linguistic modalities (audio, vision), and represent linguistic, and "literal" transcript of the source modality. (e.g. ASR, TR/OCR)
Translation/Transformation/Extraction: a subtype of text document, always aligned to another TextDocument-type annotations. The content of this annotation must be a kind "re-writing" of the source text document. (e.g. identity function in text-slicer, structural parsing in RFB, summary in text-summarizer apps)
Caption: this is similar to Transcript but the content is not "literal" transcript of the source modality (e.g. image-based captioning app, audio-based summarizer app handles non-linguistic sounds like dog barking)

Alternatives

No response

Additional context

Also see https://github.com/clamsproject/app-role-filler-binder-new/issues/4 for discussion on development of a prototype "app pattern".

wricketts commented 4 months ago

I'm in favor of this idea. Do you think it would make sense to also start enforcing/validating the directionality of alignments? Basically so that a subtype annotation can't be the source of a parent type annotation. E.g. an Alignment may never have a Translation document as the source and TextDocument as the target. Only the other way around.

marcverhagen commented 4 months ago

Another useful type may be Summary

keighrim commented 4 months ago

@marcverhagen Under the proposal, text-to-text summaries are Translations and video-to-text or image-to-text summaries are Captions. The idea is that these subtypes are not purely based on "semantics" of the type, but also have a formal proxies (in terms of associated Alignment) so that the downstream consumers are less coupled with semantic knowledge and can be more reactive to syntax of the MMIF, which should be much more robust.

@wricketts Yeah, I also like the idea of directed alignment. It's pretty clear for Translation case. But for other two cases, it becomes a bit murky, since (note that for both, most of the time, source's going to be the image or audio) there are rare cases where text is the source (e.g., "read" corpus where the script is actually the source of the audio, or text-to-image generation models) However, since we only allow TextDocuments to be annotation instances (as a second-order object compared to the objects in the top mmif.documents list), I don't think we need to consider those rare cases for now.

In terms of implementation of the directionalities, we could add an attribute to the alignment type or we can super-/subclass a "undirected-alignment".

wricketts commented 4 months ago

In terms of implementation of the directionalities, we could add an attribute to the alignment type or we can super-/subclass a "undirected-alignment".

I could see a subclass DirectedAlignment that inherits from Alignment (less constrained) having nice flexibility for validating source and target annotation at_types. But probably you or someone who's done extensive work on the SDK would have the best judgment on this.

marcverhagen commented 4 months ago

I have been thinking of an alternative for the word Translation because of its connotations. If I had read the original proposal above better I would have noticed that summary was intended to be a kind of Translation, but it is somewhat unintuitive to me tat summary is a kind of translation. I was thinking Transformation, but I don't like that either.

An alternative would be to not introduce new annotation types but a textType property where the value can be "transcript" or "summary" or "caption", similar to how we have bounding boxes of type "text". There is something to be said to have specialized types where we can express relationships that pop up in multimodal processing, there is also something to be said to keep the number of types low.

Also, instead of alignments we can think of some of the relations between text documents and other types as proper relations and think (again) about how to structure the vocabulary part under Relation.

In a related issue we discussed introducing types like CSV and CONNL in the hope that they would help with quickly rolling out a wrapped RFB app. We were thinking of those as MIME types though (incidentally, the vocabulary would have to change a bit since it only allows MIME types if you use the "location" property in stead of the "text" property).

keighrim commented 4 months ago

Yeah, I also thought about having it as a property. One thing we can (possibly greatly) benefit from a separate (sub-)type is that we can syntactically distinguish TextDocument in the source documents list and (TextDocument) in views' annotations lists. But we can take the best of both worlds by adding just one type (DDerivedTextDocument or something like that) as a new vocab member and put the subtype information as a property value.

However, another point to consider is that vocab types are actually defined behind URIs, while for property values, we don't have any effective way to regulate them (behind developers' consciousness), so we are likely to end up with dealing lots quirk-y naming and typos (we have already experienced enough problems from minor quirks like timeUnit=millisecond and timeUnit=milliseconds)

keighrim commented 4 months ago

Notes from yesterday's in-person meeting;

@marcverhagen wants to keep the number of vocab items low, while seeking for a way to "control" the property values. He will work on drafting a new vocabulary yaml file format that provides
1. (tentative name) an optional enumeration sub-field for each property that provides a finite set of pre-defined possible values.
2. without the enumeration subfield, any property should be considered to be able to bear any "free" value.
this new vocab file will be used in SDK to automatically generate some mechanism for validation of property values. @keighrim suggested revisiting https://github.com/clamsproject/mmif/issues/23 and https://github.com/clamsproject/mmif-python/issues/23 .

some additional notes from my thoughts:

The property key textType doesn't really sit well in me, in fact any xxxType keys [^1] just seem very wrong to me given that all vocab items already have "type" (@type) property, and we call them as vocabulary "types". To that end, I believe any value associated with "typing" of a type should just be an actual "type" instead of xxxType property within a "type"
My alternative suggestion for textType is origin or origination.
The values in the enumeration list must match the data type of the property, so I think we might need to revisit how we define data types of each property and add more type-theoretic formalism to our practice. This will somewhat address https://github.com/clamsproject/mmif/issues/215.

[^1]: We had frameType and boxType in the past, but no longer (https://github.com/clamsproject/mmif/issues/218) . So there's actually no xxxType property at the moment, to be clear.

marcverhagen commented 4 months ago

The property key textType doesn't really sit well in me, in fact any xxxType keys just seem very wrong to me given that all vocab items already have "type" (@type) property, and we call them as vocabulary "types". To that end, I believe any value associated with "typing" of a type should just be an actual "type" instead of xxxType property within a "type". My alternative suggestion for textType is origin or origination.

Yes, it is weird to have @type and textType. Origin is not a bad name, as long as we document exactly mean by that (everything has an origin, when do we use it). My main worry with introducing additional annotation types is potential bloating of the vocabulary, if th ebloat is very minimal then it should be okay. The notion of subtype or label does not bother me as an additional way of bringing in some kind of typing.

Also, as an additional concern is the status of the LAPPS vocabulary and the potential need to expand the CLAMS vocabulary (see https://github.com/clamsproject/mmif/issues/202).

We had frameType and boxType in the past, but no longer (https://github.com/clamsproject/mmif/issues/218) . So there's actually no xxxType property at the moment, to be clear.

They are still in the vocabulary at https://mmif.clams.ai/1.0.5/vocabulary/ so technically not quite correct. Current code in timeframe evaluation makes heavy use of frameType. We did introduce label and classification to streamline this.

keighrim commented 4 months ago

You are right in that the frameType and boxType are technically still in the vocab page. But in practice, they are there only because of historical reasons, and the latest version of them clearly indicate that using those keys are no longer recommended. SDK is aware of this "deprecation", and let you can query by either name. https://github.com/clamsproject/mmif-python/pull/262

One more thing I wanted to add to that PR 262 was automatic "canonicalization" of the key name xxxType into label at .add_property() time, but that seemed to be too "magic" so I stopped there, but I'm still more that eager to do so, so that I don't see any xxxType in future MMIFs.

Just to be clear, these xxxType properties were never used for manual "typing" of an annotation object, to my knowledge and to the current archive of app metadata.

$ grep Type clamsproject/apps/docs/_apps/**/**/metadata.json
aapb-pua-kaldi-wrapper/v1/metadata.json:      "description": "When true, the app looks for existing TimeFrame { \"frameType\": \"speech\" } annotations, and runs ASR only on those frames, instead of entire audio files.",
aapb-pua-kaldi-wrapper/v2/metadata.json:      "description": "When true, the app looks for existing TimeFrame { \"frameType\": \"speech\" } annotations, and runs ASR only on those frames, instead of entire audio files.",
barsdetection/v1.0/metadata.json:          "frameType": "bars"
barsdetection/v1.1/metadata.json:          "frameType": "bars"
chyron-detection/v1.0/metadata.json:          "frameType": "chyron"
east-textdetection/v1.0/metadata.json:      "name": "frameType",
east-textdetection/v1.1/metadata.json:      "name": "frameType",
east-textdetection/v1.2/metadata.json:      "name": "frameType",
fewshotclassifier/v1.0/metadata.json:        "frameType": "string"
fewshotclassifier/v1.0/metadata.json:      "name": "finetunedFrameType",
gentle-forced-aligner-wrapper/v1.0/metadata.json:        "frameType": "speech"
gentle-forced-aligner-wrapper/v1.0/metadata.json:        "frameType": "speech",
parseqocr-wrapper/v1.0/metadata.json:        "boxType": "text"
pyscenedetect-wrapper/v1/metadata.json:        "frameType": "shot",
pyscenedetect-wrapper/v2/metadata.json:        "frameType": "shot",
slatedetection/v1.0/metadata.json:          "frameType": "string"
slatedetection/v1.1/metadata.json:          "frameType": "string"
slatedetection/v1.2/metadata.json:          "frameType": "string"
slatedetection/v2.0/metadata.json:          "frameType": "slate"
slatedetection/v2.1/metadata.json:          "frameType": "slate"
swt-detection/v3.0/metadata.json:        "frameType": "bars"
swt-detection/v3.0/metadata.json:        "frameType": "slate"
swt-detection/v3.0/metadata.json:        "frameType": "chyron"
swt-detection/v3.0/metadata.json:        "frameType": "credits"
tesseractocr-wrapper/v1.0/metadata.json:        "boxType": "text"
tesseractocr-wrapper/v1.0/metadata.json:      "name": "frameType",
tonedetection/v1.0/metadata.json:        "frameType": "tone"

Instead they always have been used to capture algorithmic classification results (they are all now replaced with label key). So I'm tempted to say that we define "typing" as a manual task to build the hierarchical categorization of the annotations, while "labeling" is an algorithmic classification task. (To that end, should we allow "muti-labels" in the labeling task?) That all being said, the property proposed in this thread is more close to "typing" in that the value is manually set by a human, and I'm still open to add them as separate vocab items, and I don't think that is a bloat, but a necessary "upgrade".

keighrim commented 4 months ago

More thoughts on the implementation of "full validation" of annotation objects based on vocab yaml (or equivalent piece of information from spec).

the yaml file does not capture the full picture of the type definitions (For example, the yaml file doesn't contain any old versions of types). The yaml file works as a "source data" used for generating the public HTML files, and those HTLM files are the full pictures of the type definition. But they are not really ready for machine-consumption. We might want to reconsider resurrecting TTL, RDF, or other more machine-friendly formats published along with the HTML files. (https://github.com/clamsproject/mmif/issues/7#issuecomment-636090770)
To implement type/property validation in the SDK, we should either
1. that undo https://github.com/clamsproject/mmif/pull/197 this fully bind vocab type definitions to every MMIF spec version, and invalidate any past and future types (and effective, disallowing pipelines made of apps using different spec version - meaning whenever a new spec comes out, all existing apps must be updated to work together.
2. completely de-couple CLAMS vocabulary from MMIF spec (as described in this comment https://github.com/clamsproject/mmif/issues/14#issuecomment-1439055907), and retrieve the type definition at the runtime from the internet - this allows validating past and future types and their versions, but requires all apps to be "online" at the runtime. For closed archival system, this might not be a satisfiable condition.

clamsproject / mmif