As we start to add more analyses to the dynamic foraging analysis pipeline, we should improve how the computation jobs are triggered, computed, saved, and aggregated.
where a watchdog CO capsule runs indefinitely monitoring the differences between the two folders on AWS S3, /nwb and /nwb_processed. Once a new session is uploaded to /nwb, it triggeres the CO pipeline.
In other words, the computation is triggered at the granularity of session.
An obvious limitation of this approach is when you want to add a new analysis or amend an old one, you'll have to re-run all analyses on all sessions (see how I'm doing it now). Because you cannot trigger computations at the granularity of session x analysis.
The problem worsens as both the volume of data and the complexity of each analysis increase, which is happening right now.
My plan
To alleviate this limitation, I can refactor the pipeline V1. Briefly, in pipeline V2:
A computation job is defined at the level of {session}_{analysis}
An analysis is defined by a function handle with input arguments in the computation capsule
A df_job_master table stores the job status for all {session}_{analysis} combinations. The status could be pending, success, and error.
The triggering capsule triggers computation for all pending jobs by sending a job list of {session}_{analysis}s to the computation capsule. The parallel computing is also handled here.
Once done, the aggregating and uploading capsule combines results together, uploads to /nwb_processed on S3, and updates the df_job_master table.
To add a new analysis, create a new function in the computation capsule and a new column in the df_job_master table.
To re-run an analysis, set success or error to pending in the df_job_master table.
The final form
In the long run, to fully adopt the AIND infrastructure, we'll need to:
Migrate the df_job_master table to SessionDB (?)
Migrate all other tables such as df_session_xxx and df_trial_xxx to RedShiftDB (?)
Computations should be based on NWB data assets generated in CO (by pipeline like this), rather than nwb folders on S3.
Store {session}_{analysis} results as (immutable) CO data assets with correct metadata, rather than nwb_processed folders on S3.
Individual analysis should be separated into different capsules in a CO pipeline, and the pipeline should still be flexible enough to add or re-run individual analysis (with proper versioning).
Ideally, the pipeline should be backward compatible with legendary data.
Let the Streamlit app or any other dashboard app directly access data from CO data assets, SessionDB, and RedShiftDB.
Although most of the core analysis functions will be reusable (by calling various analysis libraries), this overhaul will require significant planning and development with high coding standards. We will need SWEs to help with this migration.
However, I'm not sure whether there is any SWE that has the bandwidth at the moment.
Eventually, this issue is downstream of and should be triggered by Data Product (the "NWB bottleneck") in the above diagram. To some extent, it encompasses Quality Control and Operational Management.
Currently, this issue is specific to behavior-only analysis, but in a larger picture, we should put all exploratory data analysis for behavior + physiology data under the same umbrella. For example, the FIB analysis.
Motivation
As we start to add more analyses to the dynamic foraging analysis pipeline, we should improve how the computation jobs are triggered, computed, saved, and aggregated.
This is my current design:
where a watchdog CO capsule runs indefinitely monitoring the differences between the two folders on AWS S3,
/nwb
and/nwb_processed
. Once a new session is uploaded to/nwb
, it triggeres the CO pipeline.In other words, the computation is triggered at the granularity of session.
An obvious limitation of this approach is when you want to add a new analysis or amend an old one, you'll have to re-run all analyses on all sessions (see how I'm doing it now). Because you cannot trigger computations at the granularity of session x analysis.
The problem worsens as both the volume of data and the complexity of each analysis increase, which is happening right now.
My plan
To alleviate this limitation, I can refactor the pipeline V1. Briefly, in pipeline V2:
{session}_{analysis}
analysis
is defined by a function handle with input arguments in the computation capsuledf_job_master
table stores the job status for all{session}_{analysis}
combinations. The status could bepending
,success
, anderror
.pending
jobs by sending a job list of{session}_{analysis}
s to the computation capsule. The parallel computing is also handled here./nwb_processed
on S3, and updates thedf_job_master
table.df_job_master
table.success
orerror
topending
in thedf_job_master
table.The final form
In the long run, to fully adopt the AIND infrastructure, we'll need to:
df_job_master
table to SessionDB (?)df_session_xxx
anddf_trial_xxx
to RedShiftDB (?)aind-data-transfer-service
, rather than having a watchdog CO capsule running indefinitelynwb
folders on S3.{session}_{analysis}
results as (immutable) CO data assets with correct metadata, rather thannwb_processed
folders on S3.Although most of the core analysis functions will be reusable (by calling various analysis libraries), this overhaul will require significant planning and development with high coding standards. We will need SWEs to help with this migration.
However, I'm not sure whether there is any SWE that has the bandwidth at the moment.
Relationship to other AIND efforts
aind-physio-arch
(https://github.com/AllenNeuralDynamics/aind-physio-arch/wiki):Eventually, this issue is downstream of and should be triggered by
Data Product
(the "NWB bottleneck") in the above diagram. To some extent, it encompassesQuality Control
andOperational Management
.Currently, this issue is specific to behavior-only analysis, but in a larger picture, we should put all exploratory data analysis for behavior + physiology data under the same umbrella. For example, the FIB analysis.