BobAubouin / hypotension_pred

Use data-based approach to predict intra-operative hypotension.
GNU General Public License v3.0
0 stars 0 forks source link

reproducible dataset #8

Open AwePhD opened 4 months ago

AwePhD commented 4 months ago

The dataset_download.py scripts gets data from VitalDB and performs almost no processing / selection. It roughly filters the cases to be picked.

The final dataset that we want to process is a segment (= window) samples dataset. Each segment has some meta data such as static data, case_id and label (IOH). The point of this issue is to structure the (segment) dataset building to make it easily reproducible and tweakable for future works. The current issue is only a basic proposition. We will already set up the basic element but no "hard" interface will be defined since the development is too early.

Warning: This is probably a very temporary proposition. The issue will be maintained to add update for any non-breaking changes.

Overview

image

The DatasetBuilder is a functional class to build the actual dataset's data. The <Dataset>* is here to statically type the element of the dataset. It provides an iterator interface to get the data. Then notebooks (or plan python scripts) are used to run experiments.

*: The notation <name> denotes that name is generic.

DatasetBuilder

The dataset builder needs various arguments to build the dataset, noted as BuilderConfig which is not an actual Python Class, BuilderConfig have some options for the features generation, the segment design techniques ... At first, the DatasetBuilder should not be change that much. So it will be used to generate some variants of the same data. When the research are going further, for example changing the label method, it might be needed to build an interface.

The use of this class should be set in scripts/dataset_build/. It should be deterministic how we build a dataset. These scripts should never break if they are in some published results.

The <data_folder>, the output of the DatasetBuilder, should provide all needed information for any processing. For each case, there are segment files, case<X>_s<Y>.parquet files, and a meta file case<X>_meta.parquet. Plus, as a trace of the DataBuilder, a dictionary of used BuilderConfig.

Dataset

Since we are storing our static data and features in a file, there is no static typing of the content of the dataset. The <Dataset> class is built to static type the content of a specific <dataset_folder>. There might be a fast need to abstract the <Dataset> structure after a few more experiments. Roughly speaking, this is a handy wrapper of DataFrames

This <Dataset> provides an iterator, with __getitem__ dunder method, to interface with the rest of the code.

Notebook and experiments

Those are the experiments of our research. They are tied to a <Dataset> and perform some work. That's why they should be located in `scripts//<exp_name/model_name>.ipynb, they can be python script also.

In a far (possible) future, it might be needed to create a sort of Runner abstraction. This abstraction would provide the building blocks of the experiments. This abstraction would be needed if there is the need to support various kinds of experiments. Which is not the case currently, at all.

Closing remarks

This basic issue aims to give some guidelines for the future of the repo structure. The first aims is to simply decouple the steps of the research works. We hope that structure might be the backbone of open research coupled with efficiency and extensibility.

Also, note that the structure itself is very dependent of the "segment based approach". If the formulation of the problem is diverging too much from the current state, this issue would be obsolete.

AwePhD commented 4 months ago

Dataset is built right now because there is no used. The model are ok with manipulation dataframe and/or numpy array directly.