Closed aditya0by0 closed 1 month ago
I tried running ChEBIOver50Partial
, an implementation of ChEBIOverXPartial
, with the following data configuration:
class_path: chebai.preprocessing.datasets.chebi.ChEBIOver50Partial
init_args:
top_class_id: 70815
And the following run configuration:
fit
--trainer=configs/training/default_trainer.yml
--model=configs/model/electra.yml
--model.train_metrics=configs/metrics/micro-macro-f1.yml
--model.test_metrics=configs/metrics/micro-macro-f1.yml
--model.val_metrics=configs/metrics/micro-macro-f1.yml
--model.pretrained_checkpoint=G:/github-aditya0by0/electra_pretrained.ckpt
--model.load_prefix=generator.
--data=configs/data/chebi50_partial.yml
--model.out_dim=1511
--model.criterion=configs/loss/bce.yml
--data.init_args.batch_size=10
--trainer.logger.init_args.name=chebi50_bce_unweighted
--data.init_args.num_workers=9
--model.pass_loss_kwargs=false
--data.init_args.chebi_version=231
--data.init_args.chebi_version_train=200
--data.init_args.data_limit=100
However, I encountered the following error:
....
....
self._generate_dynamic_splits()
File "G:\github-aditya0by0\python-chebai\chebai\preprocessing\datasets\base.py", line 874, in _generate_dynamic_splits
df_train, df_val, df_test = self._get_data_splits()
File "G:\github-aditya0by0\python-chebai\chebai\preprocessing\datasets\chebi.py", line 408, in _get_data_splits
train_df_chebi_ver, df_test_chebi_ver = self.get_test_split(
File "G:\github-aditya0by0\python-chebai\chebai\preprocessing\datasets\base.py", line 936, in get_test_split
train_indices, test_indices = next(msss.split(labels_list, labels_list))
File "G:\anaconda3\envs\env_chebai\lib\site-packages\sklearn\model_selection\_split.py", line 1843, in split
for train, test in self._iter_indices(X, y, groups):
File "G:\anaconda3\envs\env_chebai\lib\site-packages\iterstrat\ml_stratifiers.py", line 332, in _iter_indices
raise ValueError(
ValueError: Supported target type is: multilabel-indicator. Got 'binary' instead.
This issue occurs because we are using binary labels in the labels_list
to indicate whether an instance is a descendant of the top class ID. However, the MultilabelStratifiedShuffleSplit
expects the target labels to be in a multilabel format (multilabel-indicator).
As far as I'm aware, this happens if we don't provide more than 1 label for a multilabel classification task. Since you have chosen a top_class_id
where the resulting dataset has only 1 class surpassing the 50 SMILES threshold, the dataset only has 1 label. The split function does not like that.
The fix would be to check before the split if the number of labels if >1 and if not, to split the data with some other function.
I have implemented the suggested change. Please review.
As far as I'm aware, this happens if we don't provide more than 1 label for a multilabel classification task. Since you have chosen a
top_class_id
where the resulting dataset has only 1 class surpassing the 50 SMILES threshold, the dataset only has 1 label. The split function does not like that. The fix would be to check before the split if the number of labels if >1 and if not, to split the data with some other function.
Issue: #50