Planning: data structure for analysis

Per Dynamic Foraging Pipeline Upgrade and AIND Analysis Architecture for Discovery Science , we decided to store first-order analysis results in docDB (mongoDB) and s3 bucket. In this issue, let's we discuss and agree on the detailed data structure we want to use.

Accessing docDB and s3 bucket for analysis

Sci. Comp. has set up a docDB collection and a s3 bucket for storing analysis results https://github.com/AllenNeuralDynamics/aind-aws-infrastructure/pull/250
Messages from Helen Lin
- role: AindAnalysisProjectsRole (assumable by users in Aind-Analysis-Projects-Users) docdb database: analysis
- test code from Helen.py
- the max # of connection to our docdb cluster is 1000. So batching/ limiting the parallel connections is would be useful!
- we've deployed the analysis infrastructure to production. The bucket is aind-dynamic-foraging-analysis-prod-o5171v. You can also now use CO capsules directly (do not need to fetch session tokens explicitly). Just attach the aind-codeocean-user role to the capsule. Please see example capsule.
[ ] Han: try the test code and document the steps in detail.
[ ] Han: ask Helen to transfer data from my testing db/bucket to the prod:
- [ ] Move s3 contents from s3://aind-scratch-data/aind-dynamic-foraging-analysis/ and s3://aind-behavior-data/foraging_nwb_bonsai_processed/v2/ to s3://aind-dynamic-foraging-analysis-prod-o5171v
- [ ] Move all DB records from behavior_analysis to analysis

Data structure on docDB

We should agree on
- [ ] how we organize collections
- [ ] the data model in each collection
  - [ ] (first-order analysis) we want to store results from each job individually by its job_hash (such as model fitting)
  - [ ] (second-order analysis) we also want to easily aggregate results across sessions, like our previous df_sessions and df_trials @alexpiet
Let's discuss the details in this doc: docDb data model for analysis

File structure on s3 bucket

Binary files (.png etc.) from each first-order analysis will be stored under folder name <job_hash> in s3://aind-dynamic-foraging-analysis-prod-o5171v
Learn more about job_hash

Alternatives

The above solutions are based on Han's prototype of V2.0 pipeline. However, Jake also suggested an alternative plan that relies more on the CO ecosystem but may need further development. Here are notes from David (from AIND Analysis Architecture for Discovery Science ):

I discussed our options with Jake from CO today. He proposed a variation on our architecture:

A script/capsule that queries for sessions of interest, creates a combined data asset, attaches it to the pipeline, then triggers the analysis pipeline and waits.

Analysis pipeline runs. Nextflow's cache will not re-run any jobs that have already been run on data within that combined data asset, which is the caching behavior we want.

capsule in (1) captures a data asset per subfolder in the results, creates a combined data asset out of it, then deletes the run.

The only hard blocker on this now is that the nextflow cache files currently expire after 30 days. Jake is looking to see if we can configure a custom location for the cache for this pipeline.

Another issue is that combined data only work with external asset right now. I was already planning to start capturing processed outputs to s3://aind-open-data anyway, so soon this will not be an issue. And combined data will support internal assets in ~6m anyway.

Another issue is that combined data only work with external asset right now.

[ ] We should re-evaluate this alternative and see how it would affect our data structure decisions.

AllenNeuralDynamics / aind-analysis-arch-pipeine-dynamic-foraging

Planning: data structure for analysis #1

Accessing docDB and s3 bucket for analysis

Data structure on docDB

File structure on s3 bucket

Alternatives

Related docs