the max # of connection to our docdb cluster is 1000. So batching/ limiting the parallel connections is would be useful!
we've deployed the analysis infrastructure to production. The bucket is aind-dynamic-foraging-analysis-prod-o5171v. You can also now use CO capsules directly (do not need to fetch session tokens explicitly). Just attach the aind-codeocean-user role to the capsule. Please see example capsule.
[ ] Han: try the test code and document the steps in detail.
[ ] Han: ask Helen to transfer data from my testing db/bucket to the prod:
[ ] Move s3 contents from s3://aind-scratch-data/aind-dynamic-foraging-analysis/ and s3://aind-behavior-data/foraging_nwb_bonsai_processed/v2/ to s3://aind-dynamic-foraging-analysis-prod-o5171v
[ ] Move all DB records from behavior_analysis to analysis
Data structure on docDB
We should agree on
[ ] how we organize collections
[ ] the data model in each collection
[ ] (first-order analysis) we want to store results from each job individually by its job_hash (such as model fitting)
[ ] (second-order analysis) we also want to easily aggregate results across sessions, like our previous df_sessions and df_trials @alexpiet
I discussed our options with Jake from CO today. He proposed a variation on our architecture:
A script/capsule that queries for sessions of interest, creates a combined data asset, attaches it to the pipeline, then triggers the analysis pipeline and waits.
Analysis pipeline runs. Nextflow's cache will not re-run any jobs that have already been run on data within that combined data asset, which is the caching behavior we want.
capsule in (1) captures a data asset per subfolder in the results, creates a combined data asset out of it, then deletes the run.
The only hard blocker on this now is that the nextflow cache files currently expire after 30 days. Jake is looking to see if we can configure a custom location for the cache for this pipeline.
Another issue is that combined data only work with external asset right now. I was already planning to start capturing processed outputs to s3://aind-open-data anyway, so soon this will not be an issue. And combined data will support internal assets in ~6m anyway.
Another issue is that combined data only work with external asset right now.
[ ] We should re-evaluate this alternative and see how it would affect our data structure decisions.
Per Dynamic Foraging Pipeline Upgrade and AIND Analysis Architecture for Discovery Science , we decided to store first-order analysis results in docDB (mongoDB) and s3 bucket. In this issue, let's we discuss and agree on the detailed data structure we want to use.
Accessing docDB and s3 bucket for analysis
s3://aind-scratch-data/aind-dynamic-foraging-analysis/
ands3://aind-behavior-data/foraging_nwb_bonsai_processed/v2/
tos3://aind-dynamic-foraging-analysis-prod-o5171v
behavior_analysis
toanalysis
Data structure on docDB
df_sessions
anddf_trials
@alexpietFile structure on s3 bucket
<job_hash>
ins3://aind-dynamic-foraging-analysis-prod-o5171v
Alternatives
The above solutions are based on Han's prototype of V2.0 pipeline. However, Jake also suggested an alternative plan that relies more on the CO ecosystem but may need further development. Here are notes from David (from AIND Analysis Architecture for Discovery Science ):
Related docs