Redesigning the vocabulary and MMIF

This was raised in https://github.com/clamsproject/mmif/issues/59, but the issue goes beyond alignment so creating a new issue.

@keighrim @angus-lherrou

There is a write up of our discussion in commit c6eacf4. It sort of morphed into a proposal.

The above link proposed using TextDocuments in views for creating new text content, but it left open the possibility of using media and sub media. Both will work and both seem to be of a similar level of complication. We are not quite sure why but there was a general feeling that using TextDocuments in views is a bit cleaner.

So a MMIF document starts with a a list of media. We have currently no restrictions on how many media or how many instances of a certain type are allowed (for example, we could even have more than one VideoDocument). The one restriction we have is that the list and its contents may not change.

Some things that require more thought:

I proposed that TextDocument has a textSource property which represents what image or speech segment a text was created from. Keigh brought up the point that this amounts to having two ways to do alignment. We should probably use the Alignment annotation type for both. This by the way fits nicer with new texts as TextDocuments in views.
Not solved yet is how to do alignment. What are we aligning (any annotation type, just types that are just segmenting the input without any typing? something else)? Do we split the types as proposed in issue #59? And if so, how do we make that conceptually clear and what names do we use? What do we need for anchoring?
We probably already have a good handle on this, but we need to be clear on how to use identifiers to refer to objects. Can we force users to have a naming scheme (media identifiers start with 'm', view identifiers start with 'v'), and if we do what good will that do us? This is a bit of an unwritten rule, but we assume that (1) identifiers of views are unique modulo all views, identifiers of media are unique modulo all media, and (3) identifiers of annotation objects are unique modulo all annotation objects in the same view. So a view can have the same identifier as a medium or some annotation object. When we refer to an annotation we use the annotation's identifier if we refer from another annotation in the same view otherwise we use "view_id:annotation_id", and it should be legal to use the latter within the same view.
The medium metadata property is now on the view, should perhaps go back into the contains section. And that property is really the same as the proposed document property so we should streamline those names. Keigh suggested using document and that sounds good to me. That would result in two properties with the same name and JSON-LD contexts will not like that.
Speaking of JSON-LD contexts. We tend to have applications produce one view. We now have an application (Kaldi) that generates types from both the CLAMS and the LAPPS universe. The current way that we do contexts cannot deal with that. Restricting annotation types in one view to either CLAMS or LAPPS seems very silly.

Speaking of JSON-LD. Why and how do we use it? It is not hard to motivate JSON as an interchange format. But JSON-LD is created mostly to link web accessible content and to allow you to define node types and to define what certain features like frameType mean by mapping the feature to an IRI.

We are using JSON-LD's @type property so we can name and identify annotation types like TimeFrame and we have URLs for those that also contain definitions for common features. The value of of the @type property needs to be an IRI and we use a context to define a vocabulary so that TimeFrame is expanded to http://mmif.clams.ai/0.1.0/TimeFrame. However, at that point everything is defined relative to that vocabulary so any time we use a property it is expanded relative to the vocabulary. That also means all features expand that way, so frameType expands to http://mmif.clams.ai/0.1.0/frameType, which we won't have a URL for. The current context actually specifically maps "frameType" to "TimeFrame#frameType" which is then expanded to http://mmif.clams.ai/0.1.0/TimeFrame#frameType. One limitations here is that we cannot define a feature on different types unless we define a context for each annotation objects, which sounds wrong. This becomes even more annoying if we decide to merge the CLAMS and LAPPS vocabularies.

We now have three context files: one is imported at the top level of an MMIF file and is used to basically define a name space for MMIF specific properties like media and views, the two others are used to define CLAMSS types and LAPPS types and one of them is imported at the top level of each view. With mixed views that last approach is not possible anymore.

Three alternatives:

no context
expand types individually in the context
no JSON-LD

Alternative 1. We do not use a context and always use full IRIs as the value of the @type attribute, which has been our habit anyway. We can still use all the features in an annotation type that we want, but they would just not be linked data. This takes namespaces out of the whole picture and may impact the MMIF SDK.

Alternative 2. We do not use a vocabulary in the context file, but expand annotation types individually. We then generate a context file from the vocabulary for just the annotation types

{
    "Annotation": "http://mmif.clams.ai/0.1.0/Annotation",
    "TimePoint": "http://mmif.clams.ai/0.1.0/TimePoint",
    "TimeFrame": "http://mmif.clams.ai/0.1.0/TimeFrame",
}

The remark above on features holds here as well.

Alternative 3. We ditch JSON-LD and just stipulate that for us the @type feature has a URL as its value and that the url defines the type. JSON-LD is called light weight, but its syntax specification is unwieldy and contains passages that I need to read several times to understand. It likely is of an appropriate complexity level to do what it needs to do for linking data, but we do not need that.

With our current approach we are somewhat muddling along I feel and it results in some requirements on how we design the vocabulary. The last alternative is somewhat extreme and I do like using @type and @value. Alternative 1 is the least amount of work for us. With alternative 2 we would still need to generate a new context file if we create new types in the vocabulary, but that is easily automated.

Not solved yet is how to do alignment. What are we aligning (any annotation type, just types that are just segmenting the input without any typing? something else)? Do we split the types as proposed in issue #59? And if so, how do we make that conceptually clear and what names do we use? What do we need for anchoring?

I've been thinking of these use cases of the alignment.

A ML researcher wants to get all annotations at each time frame to generated concatenated vector representation.
A web developer wants to develop a MMIF visualizer with a video player and a transcript reader together side-by-side, and bidirectional synchronization between those two panels. i.e. 1) as video plats, auto-scroll text panel, 2) as reader click on a word in the text, seek to the point in the video player.

With that two scenarios in mind, let's think about a MMIF file that has

manual transcript (given as primary)
a token-level forced-alignment view (spans over two primary media)
a NE view only on the text medium

To get that final MMIF file we would probably start with this

{
  "documents": [
    {
      "id": "m1",
      "type": "https://.../VideoDocument",
      "mime": "video/mp4",
      "location": "/var/archive/video-0012.mp4"
    },
    {
      "id": "m2",
      "type": "https://.../TextDocument",
      "mime": "text/plain",
      "text": {
        "@value": "Karen flew to New York City.",
        "@language": "en"
      }
    }
  ]
}

Using schema proposed in be17dba3724ea39581964f36aef9e859a5980146, first view with FA will be generated as follows;

{
  "documents": [
    {
      "id": "m1",
      "type": "https://.../VideoDocument",
      "mime": "video/mp4",
      "location": "/var/archive/video-0012.mp4"
    },
    {
      "id": "m2",
      "type": "https://.../TextDocument",
      "mime": "text/plain",
      "text": {
        "@value": "Karen flew to New York City.",
        "@language": "en"
      }
    }
  ],
  "views": [
    {
      "id": "v1",
      "@context": "http://mmif.clams.ai/0.1.0/context/vocab-clams.json",
      "metadata": {
        "timestamp": "2020-05-27T12:23:45",
        "app": "http://tools.clams.ai/gentle-forced-aligner",
        "contains": {
          "http://mmif.clams.ai/0.1.0/TimeFrame": {
            "unit": "milliseconds"
          },
          "http://mmif.clams.ai/0.1.0/Alignment": {
            "sourceType": "http://mmif.clams.ai/0.1.0/TimeFrame",
            "targetType": "http://vocab.lappsgrid.org/Token"
          }
        },
        "http://vocab.lappsgrid.org/Token": {}
      },
      "document": "m1,m2",  # should this be a list of media? 
      "annotations": [
        {
          "@type": "http://vocab.lappsgrid.org/Token",
          "id": "t1",
          "properties": {
            "document": "m2",
            "start": 0,
            "end": 5,
            "text": "Karen"
          }
        },
        {
          "@type": "http://mmif.clams.ai/0.1.0/TimeFrame",
          "id": "tf1",
          "properties": {
            "document": "m1",
            "start": 17,
            "end": 1064
          }
        },
        {
          "@type": "http://mmif.clams.ai/0.1.0/Alignment",
          "id": "a1",
          "properties": {
            "source": "t1",
            "target": "tf1"
          }
        },
      ...
      ]
    }
  ]
}

Now NE tagger follows and because the tool inside uses its own tokenization rules (it has no API to force existing tokenizations), it will generate a view like this;

{
  "documents": [
    {
      "id": "m1",
      "type": "https://.../VideoDocument",
      "mime": "video/mp4",
      "location": "/var/archive/video-0012.mp4"
    },
    {
      "id": "m2",
      "type": "https://.../TextDocument",
      "mime": "text/plain",
      "text": {
        "@value": "Karen flew to New York City.",
        "@language": "en"
      }
    }
  ],
  "views": [
    {
      "id": "v1", 
      ...
    },
    {
      "id": "v2",
      "@context": "http://mmif.clams.ai/0.1.0/context/vocab-clams.json",
      "metadata": {
        "timestamp": "2020-05-27T12:25:45",
        "app": "http://tools.clams.ai/some-named-entity-tagger",
        "contains": {
          "http://vocab.lappsgrid.org/NamedEntity": {
            "namedEntityCategorySet": "CONLL-2004"
          }
        }
      },
      "document": "m2",
      "annotations": [
        {
          "@type": "http://vocab.lappsgrid.org/NamedEntity",
          "id": "n1",
          "properties": {
            "document": "m2",
            "start": 0,
            "end": 5,
            "category": "person"
          }
        },
        {
          "@type": "http://vocab.lappsgrid.org/NamedEntity",
          "id": "n2",
          "properties": {
            "document": "m2",
            "start": 14,
            "end": 27,
            "category": "geolocation"
          }
        }
      ]
    }
  ]
}

Suppose that the right above json is our final MMIF output from this pipeline, and now go back to the use case num. 2. Specifically, I want to implement text panel to video player alignment in HTML. That would be done via using alignment annotations from v1, crudely saying, by adding <a> tag that encodes the timestamp to each token and put javascript code to manipulate <video> element on each <a> tag (using the timestamp).

However, what if the web site designers decide that not every token is meaningful to the readers and want to implement such linkage only on the named entities in the text. Because the view where NE's are (v2) doesn't have any alignment annotations, the MMIF alignment magic should now 1) iterate all views to find a view that has alignment between text media and video media (v1), and 2) align v1:Token annotations with v2:NamedEntity annotations, and then finally 3) iterate though v1:Alignment annotations to find aligned timestamps for each v2:NamedEntity.

It is also not clear to me what problem this solves that cannot be solved without this split

The reason I proposed segment objects and segmentations list in #59 was to eliminate the necessity of step 2 in the above example. By keeping segmentations directly under each media (not under individual views), MMIF SDK can normalize segmentations (merging identical spans, sort spans by start point, ...) from different apps at runtime.

And then by making the alignments as a first-class object in the MMIF, we can also have one place to put all alignment information from different aligning apps (ASR, forced alignment, OCR). So that would eliminate the step 1.

So with schema proposed in #59, the MMIF would look like after processed by the forced aligner; (example JSON files below are based on draft design and are not considering many LD aspect)

{
  "documents": [
    {
      "id": "m1",
      "type": "https://.../VideoDocument",
      "mime": "video/mp4",
      "location": "/var/archive/video-0012.mp4",
      "segments": [
        {
          "id": "s1",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "17", "end": "1064" }, # Karen 
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s2",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "1241", "end": "1945" }, # flew
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s3",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "2113", "end": "2410" }, # to
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s4",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "2801", "end": "3876" }, # New
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s5",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "4017", "end": "4953" }, # York
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s6",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "5252", "end": "6123" }, # City
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        }
      ]
    },
    {
      "id": "m2",
      "type": "https://.../TextDocument",
      "mime": "text/plain",
      "text": {
        "@value": "Karen flew to New York City.",
        "@language": "en"
      },
      "segments": [
        {
          "id": "s1",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "0", "end": "5" }, # Karen
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s2",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "6", "end": "10" }, # flew
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s3",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "11", "end": "13" }, # to
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s4",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "14", "end": "17" }, # New
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s5",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "18", "end": "22" }, # York
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s6",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "23", "end": "27" }, # City
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s7",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "27", "end": "28" }, # .
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        }
      ]
    }
  ],
  "alignments": [
    {
      "id": "a1", # not sure this many metadata is useful for alignment lists
      "source": "m1",
      "target": "m2",
      "sourceType": "https://.../VideoDocument",
      "targetType": "https://.../TextDocument",
      "aligned": [
        [ "m1:s1", "m2:s1" ],
        [ "m1:s2", "m2:s2" ],
        [ "m1:s3", "m2:s3" ],
        [ "m1:s4", "m2:s4" ],
        [ "m1:s5", "m2:s5" ],
        [ "m1:s6", "m2:s6" ]
      ]
    }
  ],
  "views": [
    {
      "id": "v1",
      "@context": "http://mmif.clams.ai/0.1.0/context/vocab-clams.json",
      "metadata": {
        "timestamp": "2020-05-27T12:23:45",
        "app": "http://tools.clams.ai/gentle-forced-aligner"
      },
      "annotations": []
      # this app only creates segments, no (labeled) annotations.
      # and notably, it writes something outside of its own view, which bugs us.
    }
  ]
}

And finally next app, a named entity tagger which don't care previous tokenization, comes in and outputs this final MMIF.

{
  "documents": [
    {
      "id": "m1",
      "type": "https://.../VideoDocument",
      "mime": "video/mp4",
      "location": "/var/archive/video-0012.mp4",
      "segments": [
        {
          "id": "s1",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "17", "end": "1064" }, # Karen 
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s2",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "1241", "end": "1945" }, # flew
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s3",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "2113", "end": "2410" }, # to
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s4",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "2801", "end": "3876" }, # New
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s5",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "4017", "end": "4953" }, # York
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s6",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "5252", "end": "6123" }, # City
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        }
      ]
    },
    {
      "id": "m2",
      "type": "https://.../TextDocument",
      "mime": "text/plain",
      "text": {
        "@value": "Karen flew to New York City.",
        "@language": "en"
      },
      "segments": [
        {
          "id": "s1",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "0", "end": "5" }, # Karen
          # Karen [0,5] is also segmented by the NE tagger, but merged with existing segment
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s2",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "6", "end": "10" }, # flew
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s3",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "11", "end": "13" }, # to
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s4",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "14", "end": "17" }, # New
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s8",  # a new segment is added, in the middle of an existing list
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "14", "end": "27" }, # New York City
          "from": "v2",
          "by": "http://tools.clams.ai/some-named-entity-tagger"
        },
        {
          "id": "s5",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "18", "end": "22" }, # York
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s6",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "23", "end": "27" }, # City
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        },
        {
          "id": "s7",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "27", "end": "28" }, # .
          "from": "v1",
          "by": "http://tools.clams.ai/gentle-forced-aligner"
        }
      ]
    }
  ],
  "alignments": [
    {
      "id": "a1",
      "source": "m1",
      "target": "m2",
      "sourceType": "https://.../VideoDocument",
      "targetType": "https://.../TextDocument",
      "aligned": [
        [ "m1:s1", "m2:s1" ],
        [ "m1:s2", "m2:s2" ],
        [ "m1:s3", "m2:s3" ],
        [ "m1:s4", "m2:s4" ],
        [ "m1:s5", "m2:s5" ],
        [ "m1:s6", "m2:s6" ]
      ]
    }
  ],
  "views": [
    {
      "id": "v1",
      "@context": "http://mmif.clams.ai/0.1.0/context/vocab-clams.json",
      "metadata": {
        "timestamp": "2020-05-27T12:23:45",
        "app": "http://tools.clams.ai/gentle-forced-aligner"
      },
      "annotations": []
    }, 
    {
      "id": "v2", 
      "@context": "http://mmif.clams.ai/0.1.0/context/vocab-clams.json",
      "metadata": {
        "timestamp": "2020-05-27T12:25:45",
        "app": "http://tools.clams.ai/some-named-entity-tagger",
        "contains": {
          "http://vocab.lappsgrid.org/NamedEntity": {
            "namedEntityCategorySet": "CONLL-2004"
          }
        }
      },
      # no target document specifed as anchors already have that information
      "annotations": [
        {
          "@type": "http://vocab.lappsgrid.org/NamedEntity",
          "properties": {
            "id": "n1",
            "on": "m2:s1",
            "category": "person"
          }
        },
        {
          "@type": "http://vocab.lappsgrid.org/NamedEntity",
          "properties": {
            "id": "n2",
            "on": "m2:s8",
            "category": "geolocation"
          }
        }
      ]
    }
  ]
}

The primary reason segments (anchors) and alignments resides outside of their originating views is because they can be re-used by different apps. Also by keeping the same kinds of segments in the unified list, we can make some operations much easier. However, this approach still has many problems, such as

this allows writable objects (segments), and it doesn't feel right to many of us.
this would make it really complicated to have TextDocument (or other kinds) inside annotations list. and many others that I can't think of now.

A few remarks on this:

I think we now all agree on using document, I will banish medium from the specifications
There is a document property with a list of media/document identifiers. In this case it is not needed because the Token and the TimeFrame could each have their own document metadata property. But if we have two media that time frames could refer to we cannot use a list of documents anyway to resolve what medium to use and we need to check the individual annotations. So the list is not needed.
For the segments list we have the by property for every single element, that does not feel right.
I don't see why the above proposal makes it really complicated to have a TextDocument inside the annotations list. But if that is the case we need to introduce sub media again and make the media list editable.
The use cases identified are real, but I think the solution should be in the SDK or user-created software instead of in MMIF. The MMIF file needs to have the information needed, either explicitly or implicitly (that is, it can be calculated from what is explicit). The SDK needs to give you procedures to get to that information, potentially building auxiliary data structures in the process.

To elaborate the last point here are the two use cases:

A ML researcher wants to get all annotations at each time frame to generated concatenated vector representation.
A web developer wants to develop a MMIF visualizer with a video player and a transcript reader together side-by-side, and bidirectional synchronization between those two panels. i.e. 1) as video plats, auto-scroll text panel, 2) as reader click on a word in the text, seek to the point in the video player.

What I imagine is that the SDK could take the MMIF file and build some kind of a graph with annotations rooted in the anchors (the start and end offsets of the annotation) or other annotations. All elements in the graph would be indexed and their would be some extra functionality to do things like finding all annotations at a particular locations (single offset or span), some auxiliary data structure may be needed for this.

Use case 1. Not sure what "each time frame" means, but an SDK that could find annotations at certain time points and timeframes would be helpful. Since each annotation has an anchor or target, the information we are looking for is in the MMIF file even when following the current specifications.

Use case 2. This requires the ability to follow the entity to the token to the alignment to the time frame. Again, that information is in the MMIF file and the SDK could deliver functionality that efficiently gets that information. With the above proposal you indeed need not first trace to the Token and then the alignment and then the segment. But the developer does not need to do that when the SDK does it for you.

This is not to say of course that the current proposal is better than having a list of alignments at the top level linking to segments at the top level. But I think the use cases are not so relevant in picking what we like best. What would make sense thinking about is what kind of information we would like to easily extract using the SDK, but that is more of an issue for SDK design than MMIF design.

What we project to the outside world are MMIF and the SDK API, those need to be as simple as possible, the more complexity we hide in the SDK the better.

@keighrim @angus-lherrou

I am running into a little glitch while working out how to use alignments within views and I think this also extends to the alternative approach of having an alignment list on the top-level. The issue is that when we have different kinds of alignments in a view or an alignment list we cannot just rely on the metadata to tell us what the types of the alignment are.

With Kaldi output we have a text document aligned with a speech segment and a bunch of tokens aligned with a bunch of speech segments. We could have sourceType and targetType as regular properties as well (which we also need to do for the document property) and then overwrite the type for the text document, and something similar could be done with the list of alignments (but that won't be pretty).

An alternative is to entirely ditch the metadata properties, which would simplify things. An alignment is between two annotations and we really do not specify what type they are. Is there any SDK-related reason (or other reason) why we should not abolish the sourceType and targetType properties?

So Kaldi essentially generates three types of annotations; 1) text itself 2) tokens, and 3) timeframes (at token level). How about we decide to only align the smallest units when there are the same kinds of annotations with different units. In this case, we only align 2) and 3) (both token level, and both are shortest of their kinds).

When we do that we will lose the information that the entire text document is aligned with the entire speech/video stream, which I guess is fine, but potentially somewhat inconsistent with other text documents most of which will align with some speech segment or image region. I would still like us to have a good reason for having those types.

I don't think we lose information about alignment between textdocument and timeframe::speech segments, simply because it is trivial to reconstruct that linkage by sorting token-timeframe::token alignment list and traverse it to find matching start and end points.

Anyway, if we want to keep both alignments, I think we can also make Kaldi to generate two sets of alignments, each of which keeps the source/targetType metadata.

The core ideas of the proposal in 59 branch and my long comment above are

separation between unitization and labeling annotations
single places to keep segmentation and alignments for easy sort and merge

So, keeping that in mind, some of my thoughts on the above remarks;

I don't see why the above proposal makes it really complicated to have a TextDocument inside the annotations list. But if that is the case we need to introduce sub media again and make the media list editable.

Because I want a single place to put segmentation lists, I also want all document objects in a single list, not scattered over many views.

For the segments list we have the by property for every single element, that does not feel right.

I now think that we can purge all metadata related to their origin from segment objects. Instead we can keep a list of segment ID's that an app generated in the app's own view. We can also mark an origin of an alignment in the same way. For example, the above example after forced alignment can be something like this; (I picked verbs as field names to avoid name conflict, and renamed annotations to a verb as well to match the style, but it's just exemplary)

{
  "documents": [
    {
      "id": "m1",
      "type": "https://.../VideoDocument",
      "mime": "video/mp4",
      "location": "/var/archive/video-0012.mp4",
      "segments": [
        {
          "id": "s1",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "17", "end": "1064" }, # Karen 
        },
        ...
        {
          "id": "s6",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TimeSegment",
          "properties": { "start": "5252", "end": "6123" }, # City
        }
      ]
    },
    {
      "id": "m2",
      "type": "https://.../TextDocument",
      "mime": "text/plain",
      "text": {
        "@value": "Karen flew to New York City.",
        "@language": "en"
      },
      "segments": [
        {
          "id": "s1",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "0", "end": "5" }, # Karen
        },
        ...
        {
          "id": "s7",
          "@type": "http://mmif.clams.ai/0.1.0/vocabulary/TextSegment",
          "properties": { "start": "27", "end": "28" }, # .
        }
      ]
    }
  ],
  "alignments": [
    {
      "id": "a1", # not sure this many metadata is useful for alignment lists
      "source": "m1",
      "target": "m2",
      "sourceType": "https://.../VideoDocument",
      "targetType": "https://.../TextDocument",
      "aligned": [
        [ "m1:s1", "m2:s1" ],
        [ "m1:s2", "m2:s2" ],
        [ "m1:s3", "m2:s3" ],
        [ "m1:s4", "m2:s4" ],
        [ "m1:s5", "m2:s5" ],
        [ "m1:s6", "m2:s6" ]
      ]
    }
  ],
  "views": [
    {
      "id": "v1",
      "@context": "http://mmif.clams.ai/0.1.0/context/vocab-clams.json",
      "metadata": {
        "timestamp": "2020-05-27T12:23:45",
        "app": "http://tools.clams.ai/gentle-forced-aligner"
      },
      "segmented": ["m1:s1", "m1:s2", ... "m2:s7"],
      "annotated": [], # no *labeling* done in this view
      "aligned": ["a1"]
    }
  ]
}

Because I want a single place to put segmentation lists, I also want all document objects in a single list, not scattered over many views.

Ah, I see. That does mean we would need to add structure to the documents list unless we are willing to have potentially very long lists all with text snippets that Tesseract found in all the text boxes as top-level documents.

I now think that we can purge all metadata related to their origin from segment objects. Instead we can keep a list of segment ID's that an app generated in the app's own view. ...

I think you will have to run me through that example in our CLAMS meeting today. Is it actually implying that alignments are not annotation types anymore? Or is "a1" in the "aligned" list an alignment that is spelled out in the annotations of the view?

With separation of unitization from labeling, I think the proposal#59 introduces four types of annotations;

unitization (represented in segmentations)
labeling (represented in view>annotations)
linking between units (represented in alignments)
linking between labels (represented in view>annotations as relation annotations - subtypes of @type:GenericRelation)

I know that, in proposal#59 as drafted currently, places where these different types of annotations live and their representations are awfully inconsistent (and probably quite redundant). I'm hoping to seek improvements, but at the same time I'm also a bit doubtful whether this approach is in a right direction. Maybe I'm just making the problem lot more complicated than it is.

Maybe I'm just making the problem lot more complicated than it is.

Maybe, maybe not. I do think that what you propose makes the representation more complex, what we need to decide is whether it is worth it and whether other approaches may be simpler overall.

What we already agree on is that this would be not for a version 0.2.0. I will try to pencil down the current shape of 0.2.0 before the meeting this afternoon.

As all changes discussed in this thread are specified in spec-0.2.0, closing this issue.

clamsproject / mmif

Redesigning the vocabulary and MMIF #86