Dynamic foraging analysis pipeline V2?

hanhou commented 6 months ago

Motivation

As we start to add more analyses to the dynamic foraging analysis pipeline, we should improve how the computation jobs are triggered, computed, saved, and aggregated.

This is my current design:

where a watchdog CO capsule runs indefinitely monitoring the differences between the two folders on AWS S3, /nwb and /nwb_processed. Once a new session is uploaded to /nwb, it triggeres the CO pipeline.

In other words, the computation is triggered at the granularity of session.

An obvious limitation of this approach is when you want to add a new analysis or amend an old one, you'll have to re-run all analyses on all sessions (see how I'm doing it now). Because you cannot trigger computations at the granularity of session x analysis.

The problem worsens as both the volume of data and the complexity of each analysis increase, which is happening right now.

My plan

To alleviate this limitation, I can refactor the pipeline V1. Briefly, in pipeline V2:

A computation job is defined at the level of {session}_{analysis}
An analysis is defined by a function handle with input arguments in the computation capsule
A df_job_master table stores the job status for all {session}_{analysis} combinations. The status could be pending, success, and error.
The triggering capsule triggers computation for all pending jobs by sending a job list of {session}_{analysis}s to the computation capsule. The parallel computing is also handled here.
Once done, the aggregating and uploading capsule combines results together, uploads to /nwb_processed on S3, and updates the df_job_master table.
To add a new analysis, create a new function in the computation capsule and a new column in the df_job_master table.
To re-run an analysis, set success or error to pending in the df_job_master table.

The final form

In the long run, to fully adopt the AIND infrastructure, we'll need to:

Migrate the df_job_master table to SessionDB (?)
Migrate all other tables such as df_session_xxx and df_trial_xxx to RedShiftDB (?)
Computations should be triggered once a new raw data asset is uploaded via aind-data-transfer-service, rather than having a watchdog CO capsule running indefinitely
Computations should be based on NWB data assets generated in CO (by pipeline like this), rather than nwb folders on S3.
Store {session}_{analysis} results as (immutable) CO data assets with correct metadata, rather than nwb_processed folders on S3.
Individual analysis should be separated into different capsules in a CO pipeline, and the pipeline should still be flexible enough to add or re-run individual analysis (with proper versioning).
Ideally, the pipeline should be backward compatible with legendary data.
Let the Streamlit app or any other dashboard app directly access data from CO data assets, SessionDB, and RedShiftDB.

Although most of the core analysis functions will be reusable (by calling various analysis libraries), this overhaul will require significant planning and development with high coding standards. We will need SWEs to help with this migration.

However, I'm not sure whether there is any SWE that has the bandwidth at the moment.

Relationship to other AIND efforts

aind-physio-arch (https://github.com/AllenNeuralDynamics/aind-physio-arch/wiki):

Eventually, this issue is downstream of and should be triggered by Data Product (the "NWB bottleneck") in the above diagram. To some extent, it encompasses Quality Control and Operational Management.
Currently, this issue is specific to behavior-only analysis, but in a larger picture, we should put all exploratory data analysis for behavior + physiology data under the same umbrella. For example, the FIB analysis.

hanhou commented 3 months ago

Using AIND database

https://github.com/AllenNeuralDynamics/aind-data-access-api
Example CO pipeline: Materialized Views Pipeline

hanhou commented 3 months ago

hanhou commented 2 months ago

start a doc here

AllenNeuralDynamics / aind-foraging-behavior-bonsai-trigger-pipeline