AllenNeuralDynamics / aind-foraging-behavior-bonsai-trigger-pipeline

MIT License
1 stars 0 forks source link

Dynamic foraging analysis pipeline V2? #4

Open hanhou opened 6 months ago

hanhou commented 6 months ago

Motivation

As we start to add more analyses to the dynamic foraging analysis pipeline, we should improve how the computation jobs are triggered, computed, saved, and aggregated.

This is my current design:

image

where a watchdog CO capsule runs indefinitely monitoring the differences between the two folders on AWS S3, /nwb and /nwb_processed. Once a new session is uploaded to /nwb, it triggeres the CO pipeline.

In other words, the computation is triggered at the granularity of session.

An obvious limitation of this approach is when you want to add a new analysis or amend an old one, you'll have to re-run all analyses on all sessions (see how I'm doing it now). Because you cannot trigger computations at the granularity of session x analysis.

The problem worsens as both the volume of data and the complexity of each analysis increase, which is happening right now.

My plan

To alleviate this limitation, I can refactor the pipeline V1. Briefly, in pipeline V2:

  1. A computation job is defined at the level of {session}_{analysis}
  2. An analysis is defined by a function handle with input arguments in the computation capsule
  3. A df_job_master table stores the job status for all {session}_{analysis} combinations. The status could be pending, success, and error.
  4. The triggering capsule triggers computation for all pending jobs by sending a job list of {session}_{analysis}s to the computation capsule. The parallel computing is also handled here.
  5. Once done, the aggregating and uploading capsule combines results together, uploads to /nwb_processed on S3, and updates the df_job_master table.
  6. To add a new analysis, create a new function in the computation capsule and a new column in the df_job_master table.
  7. To re-run an analysis, set success or error to pending in the df_job_master table.

image

The final form

In the long run, to fully adopt the AIND infrastructure, we'll need to:

Although most of the core analysis functions will be reusable (by calling various analysis libraries), this overhaul will require significant planning and development with high coding standards. We will need SWEs to help with this migration.

However, I'm not sure whether there is any SWE that has the bandwidth at the moment.

Relationship to other AIND efforts

hanhou commented 3 months ago

Using AIND database

hanhou commented 3 months ago

image

hanhou commented 2 months ago

image

hanhou commented 2 months ago

start a doc here