Review and comments data loaders

fkiraly commented 3 weeks ago

Short review of data loaders, with the aim to ensure easy useability by other project members and maintainability.

Referring to current state of 1.01-cjr-change-point-index-creator.ipynb, top cell

Main comments:

first, and importantly, I would move the data loading code out of the jupyter notebook and into py files, possibly somewhere in the data folder. I would also organize the repository that all data concerns are separate, possibly also separating processed data from raw data. (more generally, raw data above a certain total size, perhaps 10 MB, should not be in GitHub repos unless dedicated data repos, as it clutters the repo)
the jupyter notebook would then just import from the "mini-package".
once the data loader is in the py file, I would strongly recommend to refactor it. At the moment it is a monolithic end-to-end function with subroutines inside. I would suggest to have a design that separates: (a) switching between file and in-memory, (b) processing pipeline; the processing is also a loop over files, so I would make it "process a single file". More precisely, I would suggest to split things as follows:
- raw data loader: file to memory, for raw data to in-memory repr of raw data
- processed data loader: file to memory, for processed data
- processed data saver: memory to file, for processed data
- data processing: memory in, memory out, for a single file
- primary user facing routines with syntactic sugar:
  - processed data loader with sugar: if processed files not present, runs everything until those are created, and makes default choices for locations. Also loops over multiple files. Optionally, one could force rerun and do checksums.
  - raw data loader with sugar
  - data processing for multiple files sugar. Could be the same method as above, but still might make sense to add the sugar in a second step to avoid.
finally, the different methods should have correct docstrings. The docstrings should reference the data dictionaries, see #108, so users know what format they can expect the data in.
optional, but recommended best practice: tests for the loaders. There are already test files in the repo, so we could use them for tests via pytest and pydantic or similar.

RobotPsychologist commented 3 weeks ago

Thanks for the feedback @fkiraly!

My intention was to move the function in some form into this file here once I had demonstrated it is working in the notebook: 0_meal_identification/meal_identification/meal_identification/dataset.py

So, instead of having a dataset.py script for each project, do I have one for all projects in the bg_control/data folder? I was initially planning on having the script with classes and functions available to use in 0_meal_identification/meal_identification/meal_identification/dataset.py (with a similar file for each project) , and the data was going to be stored in:

0_meal_identification/meal_identification/data/raw
0_meal_identification/meal_identification/data/processed
0_meal_identification/meal_identification/data/interim
0_meal_identification/meal_identification/data/external

RobotPsychologist commented 3 weeks ago

@andytubeee if you're interested.

RobotPsychologist commented 1 week ago

@andytubeee @Phiruby @Tony911029 please check if there is anything that Franz mentioned here that might also be included in the data work as an enhancement.

RobotPsychologist / bg_control

Review and comments data loaders #109