Support creation of derived datasets from post-processing of raw data

glopesdev commented 2 years ago

We have discussed for a while having a standard process for generating derived datasets for post-processing purposes, e.g. to pad or fill raw data, add new data streams from offline processing such as spike sorting, DLC tracking, etc.

More recently, it has become clear that this would be useful also for cleaning up or filling missing data for ingestion purposes (#130) or evolving data formats which turn out to be problematic (https://github.com/SainsburyWellcomeCentre/aeon_experiments/issues/99) so we would like to potentially bump up the priority of this issue.

Below are outlined a few of the requirements from earlier discussions that this process should satisfy:

acquired raw datasets should remain immutable (i.e. no in-place correction of data is allowed)
derived datasets should be interchangeable with raw datasets for ingestion or analysis purposes
it should be possible to track the provenance of each derived dataset (e.g. infer or find the raw dataset and version of processing procedure that generated it)

We have addressed 1. by restricting write access as much as possible to raw datasets, towards an ideal of only acquisition machines being able to write to raw storage.

We have made preliminary attempts to address 2. by allowing the low-level API to receive a list of root objects for locating and loading data chunks. Essentially the load function will scan each path in order and return first the chunk on the priority path, and fallback to secondary paths only if the data is missing. The goal was to allow for both overriding raw data or providing fallbacks for missing data.

For 3. we can easily version control derived dataset generation scripts and a general folder structure for qc / processed data has been specified, but are still missing the following:

a convention for naming folders inside the qc or processed folders in a way that reflects priority (is it enough to have a fixed number of priority levels?)
helper modules for making it easier to find out which processed folders are available for low-level API use (ideally avoiding having to rely on manual inputs)
simplified API and working examples for derived dataset construction (e.g. to make it easy to do tasks such as chunk data interpolation, correction for video / position data mismatch, correction for missing session data boundaries)
hooks into DJ ingestion pipeline and a process to make for a seamless integration of derived data, e.g. for cleaning up purposes

I have likely forgotten other details or questions, so feel free to comment below to suggest missing considerations or edits you would like to have made.

glopesdev commented 2 years ago

@jkbhagatio @ttngu207 @lochhh

lochhh commented 2 years ago

Related to #153 @JaerongA

jkbhagatio commented 1 year ago

generating derived datasets:
- reader (raw_file -> df) -> processor/spa(df -> df) -> writer (df -> processed_file)
raw_file and processed_file comparison should be in tests (this is only a test for consistency)
processor can be doing any form of processing (e.g. for tests, for video analysis, for ephys analysis)
only add a processor function when a bug is noticed
keep ingestion routine pristine - shouldn't have to worry about fixing bugs at ingestion-level
Parallels between Processor and Reader:
- one superclass, one processor per reader (per stream)
overall workflow pseudocode:
- load file:
- if error:
  - load() logs error info to err table (with status of "err") and sends to processor
  - run through processor, send back to load()
  - store processor calls
- if no error:
  - update status of error to "fixed"
  - log which processor was used to err table
could generate octagon metadata yamls via processor as a test case for this

SainsburyWellcomeCentre / aeon_mecha

Support creation of derived datasets from post-processing of raw data #132