Open keighrim opened 2 weeks ago
One big piece I missed in the above description was runtime configurations of the apps. I think we can treat them just like they are parts of app identification. Namely, instead of [shortname]/[version]
we can have [shortname]/[version]/[param1-val1]/[param2-val2]
in the pipeline serialization scheme, where "params" parts are alphabetically sorted for easy retrieval.
A few problems with this implementation using a directory structure
multivalued
parameter, we can either flatten values into many (series of) subdirectories, or concatenate them into one. For the latter, we need to introduce another arbitrary "syntax" to the serialization scheme. Another aspect of the problem is that not all pipelines are "serial" (some components in the pipeline can be in parallel), although with some careful consideration and design, we should be able to serialize them into a single string identifier.
For example, say that we want to 1) force-align a transcript, 2) run NER on transcript 3) find NE temporal locations in the video. Component 2 (NER) does not rely on the output from the component 1 (FA), and this is the case I mean by "in parallel". (they don't need to "run side-by-side" as software)
here are some observations / thoughts / questions at the moment:
app/version:app/version
as the fundamental layout. {parameters:representations}
. The con of this is that it's really bad for readability, arguably worse than introducing another arbitrary syntax, and requires some internal work to actually expand/compress the parameter (i.e., maybe it's overkill).appB
within appA
if they're completely disconnected, issues with replication across different pipelines, potential duplicate storage of v1:appA/v1:appB/v1
and appB/v1:appA/v1
, etc.), but since the directory structure is approximating a "timeline" of sorts, maybe Occam's razor here is to just treat the apps in the order they appear in the mmif, and then provide that ordering as a standard within the documentation. This one is messy though, and I don't have any solutions I'm completely confident proposing. No notes here, the proposed implementation of retrieval (using mmif-rewinder) seems like a perfectly valid approach.
New Feature Summary
Right now, all the prediction/hypothesis/experimental MMIF output from CLAMS pipelines are pushed to this github repo for evaluation, along with their back-pointers in the report.md files. That works fine until we hit the storage limit on the GH repo. To future-proof, we'd like to develop a systematic way of storing and retrieving MMIF output files in a more spacious (and maybe more private) storage solution.
storage side
indexing
To hold a large collection of MMIF data, I'm proposing we implement a kind of trie-based indexing system. Actual files can be stored in maybe s3 buckets, or lab servers (we have plenty hdd space anyway). The envisioned trie implementation is simply based on the apps used in the MMIF file, split into shotname and version name. This way, all the necessary "configuration" for the store API is all saved inside the data payload itself, and we don't have to come up with additional configuration scheme for the store API. In other word, user can call the store API, simply with the MMIF file itself.
storage API result example
For example, if we use directory structure for indexing, to store a MMIF
cpb-aacip-xxxx.mmif
generated from a pipeline consist ofWhen user sends the file
the file is saved as
/some_data_root/swt/v5.0/doctr/v2.0/whisper/v1.6/cpb-aacip-xxxx.mmif
Then later if we have a second MMIF file to store, using the same pipeline, except now with whisper 2.0, the file is saved as
/some_data_root/swt/v5.0/doctr/v2.0/whisper/v2.0/cpb-aacip-xxxx.mmif
This will result in a file system-based storage that looks like this at this point
retrieval side
retriever API argument structure
Now on the retrieval side, a retrieval API should expect two string arguments
simple retrieval
Then, the retriever can convert the first argument into a directory path, and look for the second argument in the directory.
retrieval with rewind
However, in addition to the simple file retrieval, we can dynamically "rewind" MMIFs if there's any decedent MMIF exists. For example, if the user asked for
pipeline=swt/v5.0:doctr/v2.0
, even though the file is not stored in the storage system, the retriever can continue "walk down" the subdirectories until it find the first MMIF, then use https://github.com/clamsproject/clams-python/issues/190 to return a partial MMIF that meets the user request.automatic garbage collection
Given the power of rewind, we can always delete any intermediate MMIF, and keep only the files in the terminal subdirectory. This can be a cronjob (if using file system), or more sophisticated DB management.
Related
Alternatives
No response
Additional context
No response