Add explicit numericalization of labels

Currently, the numericalization of labels is dependent on the order of the folders in the filesystem. A possible improvement would be to explicitly define a file with mapping information. For example

{
  'positive':0,
  'negative':1,
}

Alternatively the order of classes in metadata.yaml file of each dataset could be used. This would allow the user to explicitly filter the data by label and make the numericalization more consistent.

The following are code snippets that rely on the order of folders to label samples.

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/75e3f190104625445f79363522fc8bf16f41590f/src/genomic_benchmarks/data_check/info.py#L99-L105

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/75e3f190104625445f79363522fc8bf16f41590f/src/genomic_benchmarks/dataset_getters/pytorch_datasets.py#L38-L39

Tensorflow notebook demo :arrow_down:

CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

ML-Bioinfo-CEITEC / genomic_benchmarks

Add explicit numericalization of labels #15