AllenNeuralDynamics / aind-analysis-arch-pipeine-dynamic-foraging

CO pipeline for dynamic foraging (V2.0)
https://codeocean.allenneuraldynamics.org/capsule/3509670/tree
MIT License
0 stars 0 forks source link

Planning: data structure for analysis #1

Open hanhou opened 16 hours ago

hanhou commented 16 hours ago

Per Dynamic Foraging Pipeline Upgrade and AIND Analysis Architecture for Discovery Science , we decided to store first-order analysis results in docDB (mongoDB) and s3 bucket. In this issue, let's we discuss and agree on the detailed data structure we want to use.

Accessing docDB and s3 bucket for analysis

Data structure on docDB

File structure on s3 bucket

Alternatives

The above solutions are based on Han's prototype of V2.0 pipeline. However, Jake also suggested an alternative plan that relies more on the CO ecosystem but may need further development. Here are notes from David (from AIND Analysis Architecture for Discovery Science ):

I discussed our options with Jake from CO today. He proposed a variation on our architecture:

  1. A script/capsule that queries for sessions of interest, creates a combined data asset, attaches it to the pipeline, then triggers the analysis pipeline and waits.
  2. Analysis pipeline runs. Nextflow's cache will not re-run any jobs that have already been run on data within that combined data asset, which is the caching behavior we want.
  3. capsule in (1) captures a data asset per subfolder in the results, creates a combined data asset out of it, then deletes the run.

The only hard blocker on this now is that the nextflow cache files currently expire after 30 days. Jake is looking to see if we can configure a custom location for the cache for this pipeline.

Another issue is that combined data only work with external asset right now. I was already planning to start capturing processed outputs to s3://aind-open-data anyway, so soon this will not be an issue. And combined data will support internal assets in ~6m anyway.

Another issue is that combined data only work with external asset right now.

Related docs

hanhou commented 15 hours ago
  • We should re-evaluate this alternative and see how it would affect our data structure decisions.

@dyf could you comment on this?