ChEB-AI / python-chebai

GNU Affero General Public License v3.0
12 stars 4 forks source link

Refactor ChEBIOverXPartial, Add 1-label stratified splits #54

Closed aditya0by0 closed 1 month ago

aditya0by0 commented 2 months ago
aditya0by0 commented 2 months ago

I tried running ChEBIOver50Partial, an implementation of ChEBIOverXPartial, with the following data configuration:

class_path: chebai.preprocessing.datasets.chebi.ChEBIOver50Partial
init_args:
  top_class_id: 70815

And the following run configuration:

fit
--trainer=configs/training/default_trainer.yml
--model=configs/model/electra.yml
--model.train_metrics=configs/metrics/micro-macro-f1.yml
--model.test_metrics=configs/metrics/micro-macro-f1.yml
--model.val_metrics=configs/metrics/micro-macro-f1.yml
--model.pretrained_checkpoint=G:/github-aditya0by0/electra_pretrained.ckpt
--model.load_prefix=generator.
--data=configs/data/chebi50_partial.yml
--model.out_dim=1511
--model.criterion=configs/loss/bce.yml
--data.init_args.batch_size=10
--trainer.logger.init_args.name=chebi50_bce_unweighted
--data.init_args.num_workers=9
--model.pass_loss_kwargs=false
--data.init_args.chebi_version=231
--data.init_args.chebi_version_train=200
--data.init_args.data_limit=100

However, I encountered the following error:

 ....
 ....
    self._generate_dynamic_splits()
  File "G:\github-aditya0by0\python-chebai\chebai\preprocessing\datasets\base.py", line 874, in _generate_dynamic_splits
    df_train, df_val, df_test = self._get_data_splits()
  File "G:\github-aditya0by0\python-chebai\chebai\preprocessing\datasets\chebi.py", line 408, in _get_data_splits
    train_df_chebi_ver, df_test_chebi_ver = self.get_test_split(
  File "G:\github-aditya0by0\python-chebai\chebai\preprocessing\datasets\base.py", line 936, in get_test_split
    train_indices, test_indices = next(msss.split(labels_list, labels_list))
  File "G:\anaconda3\envs\env_chebai\lib\site-packages\sklearn\model_selection\_split.py", line 1843, in split
    for train, test in self._iter_indices(X, y, groups):
  File "G:\anaconda3\envs\env_chebai\lib\site-packages\iterstrat\ml_stratifiers.py", line 332, in _iter_indices
    raise ValueError(
ValueError: Supported target type is: multilabel-indicator. Got 'binary' instead.

This issue occurs because we are using binary labels in the labels_list to indicate whether an instance is a descendant of the top class ID. However, the MultilabelStratifiedShuffleSplit expects the target labels to be in a multilabel format (multilabel-indicator).

sfluegel05 commented 2 months ago

As far as I'm aware, this happens if we don't provide more than 1 label for a multilabel classification task. Since you have chosen a top_class_id where the resulting dataset has only 1 class surpassing the 50 SMILES threshold, the dataset only has 1 label. The split function does not like that. The fix would be to check before the split if the number of labels if >1 and if not, to split the data with some other function.

aditya0by0 commented 1 month ago

I have implemented the suggested change. Please review.

As far as I'm aware, this happens if we don't provide more than 1 label for a multilabel classification task. Since you have chosen a top_class_id where the resulting dataset has only 1 class surpassing the 50 SMILES threshold, the dataset only has 1 label. The split function does not like that. The fix would be to check before the split if the number of labels if >1 and if not, to split the data with some other function.