RL model fitting pipeline

~~Now that RL MLE model fitting library is ready, I'd like to use it as the first MVP of our new analysis pipeline (doc here). I'm going to try what Jake propsed (see below).~~

Related issues

Steps:

[x] Create an analysis capsule (git) that
- receives nwb files (loop over data folder) and model specs (using named parameters as capsule input)
- triggers model fitting by calling the current library
- saves data (pickled fitting_results, json documents, and figures) to s3 (temporarily until the 30 days limitation is fixed)
[ ] ~~Create the master pipeline with the only one analysis in it~~
- [ ] ~~test the cache behavior~~
  - [ ] ~~With the help of David, implement the watchdog capsule that~~
  - [ ] ~~> creates a combined data asset, attaches it to the pipeline, then triggers the analysis pipeline and waits.~~
  - [ ] ~~> captures a data asset per subfolder in the results, creates a combined data asset out of it, then deletes the run.~~

I changed my mind. Expecting that there might be many roadblocks messing with CO, I chose to do some fast prototyping with our initial thoughts: building our own hashing and job dispatching machanism.

Version control

$Job = Analysis_{ver}(session)$

Inputs:

nwb_name
analysis_spec = {analysis_ver (loose version) + analysis_libs + analysis_args}
Use job_hash to hash nwb_name and analysis_spec
- this hash will be used by the job dispatcher and also the folder name on s3

Outputs:

actual repo versions are recorded, but it will not be used to identify analysis, because there is no one-to-one mapping between repo version and analysis version (e.g., analysis results may not change even with new repo updates). This is why analysis version is tracked by analysis_ver and set manually.

Policy:

Job dispatcher looks at job_hash to determine new analysis
If analysis repo is updated and a new analysis should be triggered, manually change analysis_ver so that new analysis_hash and job_hash will be generated

From David:

I discussed our options with Jake from CO today. He proposed a variation on our architecture:

A script/capsule that queries for sessions of interest, creates a combined data asset, attaches it to the pipeline, then triggers the analysis pipeline and waits.

Analysis pipeline runs. Nextflow's cache will not re-run any jobs that have already been run on data within that combined data asset, which is the caching behavior we want.

capsule in (1) captures a data asset per subfolder in the results, creates a combined data asset out of it, then deletes the run.

The only hard blocker on this now is that the nextflow cache files currently expire after 30 days. Jake is looking to see if we can configure a custom location for the cache for this pipeline.

Another issue is that combined data only work with external asset right now. I was already planning to start capturing processed outputs to s3://aind-open-data anyway, so soon this will not be an issue. And combined data will support internal assets in ~6m anyway.

Another issue is that combined data only work with external asset right now.

Should not be an issue because we'll get nwbs from s3 anyway (s3://aind-behavior-data/foraging_nwb_bonsai/).

AllenNeuralDynamics / aind-dynamic-foraging-models

RL model fitting pipeline #33

2