SainsburyWellcomeCentre / aeon_mecha

Project Aeon's main library for interfacing with acquired data. Contains modules for raw data file io, data querying, data processing, data qc, database ingestion, and building computational data pipelines.
BSD 3-Clause "New" or "Revised" License
4 stars 5 forks source link

Support creation of derived datasets from post-processing of raw data #132

Open glopesdev opened 2 years ago

glopesdev commented 2 years ago

We have discussed for a while having a standard process for generating derived datasets for post-processing purposes, e.g. to pad or fill raw data, add new data streams from offline processing such as spike sorting, DLC tracking, etc.

More recently, it has become clear that this would be useful also for cleaning up or filling missing data for ingestion purposes (#130) or evolving data formats which turn out to be problematic (https://github.com/SainsburyWellcomeCentre/aeon_experiments/issues/99) so we would like to potentially bump up the priority of this issue.

Below are outlined a few of the requirements from earlier discussions that this process should satisfy:

  1. acquired raw datasets should remain immutable (i.e. no in-place correction of data is allowed)
  2. derived datasets should be interchangeable with raw datasets for ingestion or analysis purposes
  3. it should be possible to track the provenance of each derived dataset (e.g. infer or find the raw dataset and version of processing procedure that generated it)

We have addressed 1. by restricting write access as much as possible to raw datasets, towards an ideal of only acquisition machines being able to write to raw storage.

We have made preliminary attempts to address 2. by allowing the low-level API to receive a list of root objects for locating and loading data chunks. Essentially the load function will scan each path in order and return first the chunk on the priority path, and fallback to secondary paths only if the data is missing. The goal was to allow for both overriding raw data or providing fallbacks for missing data.

For 3. we can easily version control derived dataset generation scripts and a general folder structure for qc / processed data has been specified, but are still missing the following:

I have likely forgotten other details or questions, so feel free to comment below to suggest missing considerations or edits you would like to have made.

glopesdev commented 2 years ago

@jkbhagatio @ttngu207 @lochhh

lochhh commented 2 years ago

Related to #153 @JaerongA

jkbhagatio commented 1 year ago