Open hanhou opened 1 week ago
From David:
I discussed our options with Jake from CO today. He proposed a variation on our architecture:
- A script/capsule that queries for sessions of interest, creates a combined data asset, attaches it to the pipeline, then triggers the analysis pipeline and waits.
- Analysis pipeline runs. Nextflow's cache will not re-run any jobs that have already been run on data within that combined data asset, which is the caching behavior we want.
- capsule in (1) captures a data asset per subfolder in the results, creates a combined data asset out of it, then deletes the run.
The only hard blocker on this now is that the nextflow cache files currently expire after 30 days. Jake is looking to see if we can configure a custom location for the cache for this pipeline.
Another issue is that combined data only work with external asset right now. I was already planning to start capturing processed outputs to s3://aind-open-data anyway, so soon this will not be an issue. And combined data will support internal assets in ~6m anyway.
Another issue is that combined data only work with external asset right now.
Should not be an issue because we'll get nwbs from s3 anyway (s3://aind-behavior-data/foraging_nwb_bonsai/
).
Now that RL MLE model fitting library is ready, I'd like to use it as the first MVP of our new analysis pipeline (doc here). I'm going to try what Jake propsed (see below).Related issues
2
Steps:
data
folder) and model specs (usingnamed parameters
as capsule input)fitting_results
, json documents, and figures) to s3 (temporarily until the 30 days limitation is fixed)Create the master pipeline with the only one analysis in ittest the cache behaviorWith the help of David, implement the watchdog capsule that> creates a combined data asset, attaches it to the pipeline, then triggers the analysis pipeline and waits.> captures a data asset per subfolder in the results, creates a combined data asset out of it, then deletes the run.I changed my mind. Expecting that there might be many roadblocks messing with CO, I chose to do some fast prototyping with our initial thoughts: building our own hashing and job dispatching machanism.
Version control
Inputs:
nwb_name
analysis_spec
= {analysis_ver
(loose version) +analysis_libs
+analysis_args
}job_hash
to hashnwb_name
andanalysis_spec
Outputs:
analysis_ver
and set manually.Policy:
job_hash
to determine new analysisanalysis_ver
so that newanalysis_hash
andjob_hash
will be generated