Closed MartinekV closed 2 years ago
Update: We dont have to require explicit labels, but we need to give users the opportunity to explicitly numericalize labels. This means we should add optional parameter to the pytorch datasets, which will allow for explicit numericalization.
Currently, the numericalization of labels is dependent on the order of the folders in the filesystem. A possible improvement would be to explicitly define a file with mapping information. For example
Alternatively the order of classes in metadata.yaml file of each dataset could be used. This would allow the user to explicitly filter the data by label and make the numericalization more consistent.
The following are code snippets that rely on the order of folders to label samples.
https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/75e3f190104625445f79363522fc8bf16f41590f/src/genomic_benchmarks/data_check/info.py#L99-L105
https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/75e3f190104625445f79363522fc8bf16f41590f/src/genomic_benchmarks/dataset_getters/pytorch_datasets.py#L38-L39
Tensorflow notebook demo :arrow_down: