clamsproject / mmif

MultiMedia Interchange Format
Apache License 2.0
5 stars 1 forks source link

recording non-annotation binary files generated during app `annotate()` #205

Open keighrim opened 1 year ago

keighrim commented 1 year ago

(We had this discussion many times in the past, which ended up with the conclusion to drop this idea internally. So I thought we must have a GH issue on this, but couldn't find one in here and mmif-python, and clams-python repos. If anyone knows that such an issue exists and this is a duplicate, please close this as duplicate)


When an app generates some binary format as a part of its processing (in the middle of annotate()) that can be re-used either by other directly pipelined CLAMS apps in the downstream, or in the post-processing (including consumers), if we have a machinery to store the intermediate binary files in a consistent way.

Here's an excerpt from a recent discussion from slack


Owen King What I'm wondering about is where CLAMS output goes when it doesn't fit into MMIF. At least as I understand the CLAMS framework, each app is a function from an item and an MMIF file to a new (or edited) MMIF file. So, what do we do with a still image (or any other supplementary binary data) created by a CLAMS app? I see three options:

  1. Embed the binary data in MMIF. That seems fine for a few kilobytes but potentially awkward for larger blobs.
  2. Create an additional file and store it elsewhere, as a side effect of the CLAMS processing. My reservation about this is that it violates the principle that a CLAMS app is simply a function from an item and an MMIF file to an MMIF file.
  3. Give information about how to create that extra output -- e.g., give the timestamp of the coordinates of a still image to be extracted from a video -- and then have some other application come in later to perform the extraction. The problem with this approach is that the supplementary application will have to redo processing that the CLAMS app already did; so it's not very efficient.

Keigh Rim Throwing my few cents regarding those options (from our Brandeis-internal discussion on this issue);

  1. We didn’t want to embed binary data inside MMIF files, as MMIF themselves are easily bloated into dozens or hundreds of MB pretty quickly with a long videos and multiple apps’ annotations.
  2. One example we used was “demux” results. When a pipeline with multiple audio apps is fed with a video file, should the first audio app store the demuxed audio files somewhere else and pass it down to other downstream audio apps? Technically, if the storage “address” (whether it’s local FS or remote location over HTTP) will be consistent throughout the rest of the pipeline operation, we can insert a new document object to the intermediate MMIF output with intermediate media (binary) files.
  3. Currently, this is the way it works. And I agree that it is awfully inefficient. Going back to the demuxing example, we found that demuxing audio stream from 1hr video takes on 2-3 mins at top, so probably that’s why we just decided swallow the cost.

Owen King Thanks for those insights krim. I do find Option 3 attractive; it seems the most elegant. But I assume the inefficiency will be too much to bear for larger jobs. I'm also thinking about Option 2 from the standpoint of not only intermediate binary files but also output binary files that are the final result of a CLAMS process. As I said in a recent meeting, one thing that would be potentially useful to GBH would be the extraction of still images of slates and chyrons. These stills could be useful for several purposes. In particular, some collections might warrant careful manual cataloging. In that case, having the slates and chyrons extracted in advance could be a big speed-up for the human process. Plus, although it's not something we are doing right now, it might be nice to be able to display the still image of the slate to the user on the AAPB website. The extracted stills wouldn't be a replacement for full slate and chyron tools that included OCR, NER, key-value pairing, and Wikidata linking. But they could be potentially very useful nevertheless. For those reasons, it might actually be a good idea to have a way to store binary output from a CLAMS app. I just don't know how that fits into the overall CLAMS framework.