clamsproject / aapb-evaluations

Collection of evaluation codebases
Apache License 2.0
0 stars 1 forks source link

MMIF storage API #50

Open keighrim opened 2 weeks ago

keighrim commented 2 weeks ago

New Feature Summary


Right now, all the prediction/hypothesis/experimental MMIF output from CLAMS pipelines are pushed to this github repo for evaluation, along with their back-pointers in the report.md files. That works fine until we hit the storage limit on the GH repo. To future-proof, we'd like to develop a systematic way of storing and retrieving MMIF output files in a more spacious (and maybe more private) storage solution.

storage side

indexing

To hold a large collection of MMIF data, I'm proposing we implement a kind of trie-based indexing system. Actual files can be stored in maybe s3 buckets, or lab servers (we have plenty hdd space anyway). The envisioned trie implementation is simply based on the apps used in the MMIF file, split into shotname and version name. This way, all the necessary "configuration" for the store API is all saved inside the data payload itself, and we don't have to come up with additional configuration scheme for the store API. In other word, user can call the store API, simply with the MMIF file itself.

storage API result example

For example, if we use directory structure for indexing, to store a MMIF cpb-aacip-xxxx.mmif generated from a pipeline consist of

  1. swt/v5.0
  2. doctr/v2.0
  3. whisper/v1.6

When user sends the file

curl -d@cpb-aacip-xxxx.mmif mmif-storage.clams.ai/store

the file is saved as /some_data_root/swt/v5.0/doctr/v2.0/whisper/v1.6/cpb-aacip-xxxx.mmif

Then later if we have a second MMIF file to store, using the same pipeline, except now with whisper 2.0, the file is saved as /some_data_root/swt/v5.0/doctr/v2.0/whisper/v2.0/cpb-aacip-xxxx.mmif

This will result in a file system-based storage that looks like this at this point

some_data_root/
└── swt
    └── v5.0
        └── doctr
            └── v2.0
                └── whisper
                    ├── v1.6
                    │   └── cpb-aacip-xxxx.mmif
                    └── v2.0
                        └── cpb-aacip-xxxx.mmif

retrieval side

retriever API argument structure

Now on the retrieval side, a retrieval API should expect two string arguments

  1. pipeline configuration, concat into a single str
  2. aapb media GUID

simple retrieval

Then, the retriever can convert the first argument into a directory path, and look for the second argument in the directory.

curl mmif-storage.clams.ai/retrieve?pipeline=swt/v5.0:doctr/v2.0:whisper/v1.6&guid=cpb-aacip-xxxx

retrieval with rewind

However, in addition to the simple file retrieval, we can dynamically "rewind" MMIFs if there's any decedent MMIF exists. For example, if the user asked for pipeline=swt/v5.0:doctr/v2.0, even though the file is not stored in the storage system, the retriever can continue "walk down" the subdirectories until it find the first MMIF, then use https://github.com/clamsproject/clams-python/issues/190 to return a partial MMIF that meets the user request.

automatic garbage collection

Given the power of rewind, we can always delete any intermediate MMIF, and keep only the files in the terminal subdirectory. This can be a cronjob (if using file system), or more sophisticated DB management.

Related

Alternatives

No response

Additional context

No response

keighrim commented 2 weeks ago

One big piece I missed in the above description was runtime configurations of the apps. I think we can treat them just like they are parts of app identification. Namely, instead of [shortname]/[version] we can have [shortname]/[version]/[param1-val1]/[param2-val2] in the pipeline serialization scheme, where "params" parts are alphabetically sorted for easy retrieval.

A few problems with this implementation using a directory structure

  1. for multivalued parameter, we can either flatten values into many (series of) subdirectories, or concatenate them into one. For the latter, we need to introduce another arbitrary "syntax" to the serialization scheme.
  2. parameter values can be pretty much anything, including whitespace and newlines, etc, that can't be safely used in a directory name
  3. parameter values can be any length, while file paths have length limit.
keighrim commented 2 weeks ago

Another aspect of the problem is that not all pipelines are "serial" (some components in the pipeline can be in parallel), although with some careful consideration and design, we should be able to serialize them into a single string identifier.

For example, say that we want to 1) force-align a transcript, 2) run NER on transcript 3) find NE temporal locations in the video. Component 2 (NER) does not rely on the output from the component 1 (FA), and this is the case I mean by "in parallel". (they don't need to "run side-by-side" as software)

MrSqually commented 2 weeks ago

here are some observations / thoughts / questions at the moment:

Indexing

Retrieval

No notes here, the proposed implementation of retrieval (using mmif-rewinder) seems like a perfectly valid approach.