ersilia-os / zaira-chem

Automated QSAR based on multiple small molecule descriptors
GNU General Public License v3.0
30 stars 11 forks source link

Classification predictor fails when passing regression results #10

Closed GemmaTuron closed 1 year ago

GemmaTuron commented 2 years ago

Describe the bug When using a Zairachem Classification Model, if the prediction dataset contains regression values, it tries to use a regressor model and crashes. I think this only happens when you have trained the model without specifying a cutoff, but with binarized data already. zairaChem does not have any threshold it can use to convert the regression values to a classification. I need to confirm that bit though.

07:56:09 | DEBUG    | There is continuous data
07:56:09 | DEBUG    | Data is not simply a binary classification
Traceback (most recent call last):
  File "/home/gturon/anaconda3/envs/zairachem/bin/zairachem", line 33, in <module>
    sys.exit(load_entry_point('zairachem', 'console_scripts', 'zairachem')())
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/cli/commands/predict.py", line 44, in predict
    s.setup()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/prediction.py", line 162, in setup
    self._tasks()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/prediction.py", line 105, in _tasks
    os.path.join(self.output_dir, DATA_SUBFOLDER)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 437, in run
    reg = reg_tasks.as_dict()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 152, in as_dict
    res["reg_pwr_skip"] = self.pwr(raw)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 132, in pwr
    os.path.join(self._load_path, DATA_SUBFOLDER, "pwr_transformer.joblib")
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 650, in load
    with open(filename, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/gturon/Desktop/cyps/cyp2d6_model/data/pwr_transformer.joblib'

To Reproduce Steps to reproduce the behavior:

  1. Train a classifier using zairachem (pass binary data already(
  2. Use the trained classifier to predict a batch of molecules with associated regression data
  3. See error

Expected behavior Zairachem ignores the real results column and does the predictions anyway. If it has the threshold, it can try to convert the regression values to a binary clf and use that for producing the performance reports

Desktop (please complete the following information):

GemmaTuron commented 2 years ago

More casuistics that we can improve. When the input for the prediction contains BOTH Binary and regression values, ZairaChem fails with the following error:

Traceback (most recent call last):
  File "/home/gturon/anaconda3/envs/zairachem/bin/zairachem", line 33, in <module>
    sys.exit(load_entry_point('zairachem', 'console_scripts', 'zairachem')())
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gturon/anaconda3/envs/zairachem/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/cli/commands/predict.py", line 44, in predict
    s.setup()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/prediction.py", line 158, in setup
    self._normalize_input()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/prediction.py", line 83, in _normalize_input
    f.process()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/files.py", line 363, in process
    df = self.normalize_dataframe()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/files.py", line 310, in normalize_dataframe
    resolved_columns = self.resolve_columns()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/schema.py", line 200, in resolve_columns
    ), "More than one values column found! {0}".format(values_column)
AssertionError: More than one values column found! ['exp', 'bin']

We should make clear in the docs the input format accepted

GemmaTuron commented 2 years ago

For a classification model trained passing binary data directly (no cut-off specified) At prediction time, the input must be either:

For a classification model trained passing regression data and a specified cutoff, at prediction time you can pass:

Of course passing the real results enables the evaluation of the outputs.

miquelduranfrigola commented 1 year ago

Thanks @GemmaTuron - the issue is now solved. If a bin column is available, this is the preferred one.