Open glopesdev opened 2 years ago
@jkbhagatio @ttngu207 @lochhh
Related to #153 @JaerongA
generating derived datasets:
raw_file and processed_file comparison should be in tests (this is only a test for consistency)
processor can be doing any form of processing (e.g. for tests, for video analysis, for ephys analysis)
only add a processor function when a bug is noticed
keep ingestion routine pristine - shouldn't have to worry about fixing bugs at ingestion-level
Parallels between Processor and Reader:
overall workflow pseudocode:
load()
logs error info to err table (with status of "err") and sends to processorload()
could generate octagon metadata yamls via processor as a test case for this
We have discussed for a while having a standard process for generating derived datasets for post-processing purposes, e.g. to pad or fill raw data, add new data streams from offline processing such as spike sorting, DLC tracking, etc.
More recently, it has become clear that this would be useful also for cleaning up or filling missing data for ingestion purposes (#130) or evolving data formats which turn out to be problematic (https://github.com/SainsburyWellcomeCentre/aeon_experiments/issues/99) so we would like to potentially bump up the priority of this issue.
Below are outlined a few of the requirements from earlier discussions that this process should satisfy:
We have addressed 1. by restricting write access as much as possible to raw datasets, towards an ideal of only acquisition machines being able to write to raw storage.
We have made preliminary attempts to address 2. by allowing the low-level API to receive a list of
root
objects for locating and loading data chunks. Essentially theload
function will scan each path in order and return first the chunk on the priority path, and fallback to secondary paths only if the data is missing. The goal was to allow for both overriding raw data or providing fallbacks for missing data.For 3. we can easily version control derived dataset generation scripts and a general folder structure for
qc
/processed
data has been specified, but are still missing the following:qc
orprocessed
folders in a way that reflects priority (is it enough to have a fixed number of priority levels?)processed
folders are available for low-level API use (ideally avoiding having to rely on manual inputs)I have likely forgotten other details or questions, so feel free to comment below to suggest missing considerations or edits you would like to have made.