MMIF storage API - Githubissues

keighrim commented 2 weeks ago

New Feature Summary

subtask of https://github.com/clamsproject/aapb-evaluations/issues/3 , particularly as a part of "pipeline-runner" component.
also almost duplicate to https://github.com/clamsproject/aapb-evaluations/issues/10

Right now, all the prediction/hypothesis/experimental MMIF output from CLAMS pipelines are pushed to this github repo for evaluation, along with their back-pointers in the report.md files. That works fine until we hit the storage limit on the GH repo. To future-proof, we'd like to develop a systematic way of storing and retrieving MMIF output files in a more spacious (and maybe more private) storage solution.

storage side

indexing

To hold a large collection of MMIF data, I'm proposing we implement a kind of trie-based indexing system. Actual files can be stored in maybe s3 buckets, or lab servers (we have plenty hdd space anyway). The envisioned trie implementation is simply based on the apps used in the MMIF file, split into shotname and version name. This way, all the necessary "configuration" for the store API is all saved inside the data payload itself, and we don't have to come up with additional configuration scheme for the store API. In other word, user can call the store API, simply with the MMIF file itself.

storage API result example

For example, if we use directory structure for indexing, to store a MMIF cpb-aacip-xxxx.mmif generated from a pipeline consist of

swt/v5.0
doctr/v2.0
whisper/v1.6

When user sends the file

curl -d@cpb-aacip-xxxx.mmif mmif-storage.clams.ai/store

the file is saved as /some_data_root/swt/v5.0/doctr/v2.0/whisper/v1.6/cpb-aacip-xxxx.mmif

Then later if we have a second MMIF file to store, using the same pipeline, except now with whisper 2.0, the file is saved as /some_data_root/swt/v5.0/doctr/v2.0/whisper/v2.0/cpb-aacip-xxxx.mmif

This will result in a file system-based storage that looks like this at this point

some_data_root/
└── swt
    └── v5.0
        └── doctr
            └── v2.0
                └── whisper
                    ├── v1.6
                    │   └── cpb-aacip-xxxx.mmif
                    └── v2.0
                        └── cpb-aacip-xxxx.mmif

retrieval side

retriever API argument structure

Now on the retrieval side, a retrieval API should expect two string arguments

pipeline configuration, concat into a single str
aapb media GUID

simple retrieval

Then, the retriever can convert the first argument into a directory path, and look for the second argument in the directory.

curl mmif-storage.clams.ai/retrieve?pipeline=swt/v5.0:doctr/v2.0:whisper/v1.6&guid=cpb-aacip-xxxx

retrieval with rewind

However, in addition to the simple file retrieval, we can dynamically "rewind" MMIFs if there's any decedent MMIF exists. For example, if the user asked for pipeline=swt/v5.0:doctr/v2.0, even though the file is not stored in the storage system, the retriever can continue "walk down" the subdirectories until it find the first MMIF, then use https://github.com/clamsproject/clams-python/issues/190 to return a partial MMIF that meets the user request.

automatic garbage collection

Given the power of rewind, we can always delete any intermediate MMIF, and keep only the files in the terminal subdirectory. This can be a cronjob (if using file system), or more sophisticated DB management.

Alternatives

No response

Additional context

No response

keighrim commented 2 weeks ago

One big piece I missed in the above description was runtime configurations of the apps. I think we can treat them just like they are parts of app identification. Namely, instead of [shortname]/[version] we can have [shortname]/[version]/[param1-val1]/[param2-val2] in the pipeline serialization scheme, where "params" parts are alphabetically sorted for easy retrieval.

A few problems with this implementation using a directory structure

for multivalued parameter, we can either flatten values into many (series of) subdirectories, or concatenate them into one. For the latter, we need to introduce another arbitrary "syntax" to the serialization scheme.
parameter values can be pretty much anything, including whitespace and newlines, etc, that can't be safely used in a directory name
parameter values can be any length, while file paths have length limit.

keighrim commented 2 weeks ago

Another aspect of the problem is that not all pipelines are "serial" (some components in the pipeline can be in parallel), although with some careful consideration and design, we should be able to serialize them into a single string identifier.

For example, say that we want to 1) force-align a transcript, 2) run NER on transcript 3) find NE temporal locations in the video. Component 2 (NER) does not rely on the output from the component 1 (FA), and this is the case I mean by "in parallel". (they don't need to "run side-by-side" as software)

MrSqually commented 2 weeks ago

here are some observations / thoughts / questions at the moment:

Indexing

I think the trie structure makes sense, and I support the use of app/version:app/version as the fundamental layout.
runtime parameters do complicate things quite a bit. I think that the overwhelming majority of use cases will sidestep the problems you listed, but maybe we can figure out some sort of hash / shortening algorithm for parameters? if we can generate a fixed length, bidirectional representation of a given param+value pair, we can translate it during both indexing and during lookup for retrieval without otherwise impacting the relationship between the directory layout and the information within a given subtree, and without needing to store a huge map of {parameters:representations}. The con of this is that it's really bad for readability, arguably worse than introducing another arbitrary syntax, and requires some internal work to actually expand/compress the parameter (i.e., maybe it's overkill).
I think there are two ways to handle the "parallel" apps problem. Either we come up with a design solution, or we sidestep the problem entirely by framing the notion of a "mmif pipeline" as being not necessarily a connected sequence of annotations, but a series of application outputs, in which case A=>B and B=>A would be understood as different, even if they don't rely on each other and ultimately produce the same views. I'm not entirely sure the former has an elegant solution, and I understand the problems with the latter (why store the output of appB within appA if they're completely disconnected, issues with replication across different pipelines, potential duplicate storage of v1:appA/v1:appB/v1 and appB/v1:appA/v1, etc.), but since the directory structure is approximating a "timeline" of sorts, maybe Occam's razor here is to just treat the apps in the order they appear in the mmif, and then provide that ordering as a standard within the documentation. This one is messy though, and I don't have any solutions I'm completely confident proposing.

Retrieval

No notes here, the proposed implementation of retrieval (using mmif-rewinder) seems like a perfectly valid approach.

clamsproject / aapb-evaluations

MMIF storage API #50

New Feature Summary

storage side

indexing

storage API result example

retrieval side

retriever API argument structure

simple retrieval

retrieval with rewind

automatic garbage collection

Related

Alternatives

Additional context

Indexing

Retrieval