ML-Bioinfo-CEITEC / genomic_benchmarks

Benchmarks for classification of genomic sequences
Apache License 2.0
107 stars 14 forks source link

Add explicit numericalization of labels #15

Closed MartinekV closed 2 years ago

MartinekV commented 2 years ago

Currently, the numericalization of labels is dependent on the order of the folders in the filesystem. A possible improvement would be to explicitly define a file with mapping information. For example

{
  'positive':0,
  'negative':1,
}

Alternatively the order of classes in metadata.yaml file of each dataset could be used. This would allow the user to explicitly filter the data by label and make the numericalization more consistent.

The following are code snippets that rely on the order of folders to label samples.

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/75e3f190104625445f79363522fc8bf16f41590f/src/genomic_benchmarks/data_check/info.py#L99-L105

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/75e3f190104625445f79363522fc8bf16f41590f/src/genomic_benchmarks/dataset_getters/pytorch_datasets.py#L38-L39

Tensorflow notebook demo :arrow_down:

CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)
MartinekV commented 2 years ago

Update: We dont have to require explicit labels, but we need to give users the opportunity to explicitly numericalize labels. This means we should add optional parameter to the pytorch datasets, which will allow for explicit numericalization.