ersilia-os / zaira-chem

Automated QSAR based on multiple small molecule descriptors
GNU General Public License v3.0
30 stars 11 forks source link

Classifier identified as regressor #39

Closed GemmaTuron closed 5 months ago

GemmaTuron commented 7 months ago

Describe the bug If a binary classification file is passed, with the activity column already in binary, and the following command is run: zairachem fit -i input.csv -m model_folder ZairaChem interprets it as a regression, not a binary classification, as indicated by the data/parameters.json file:

{
    "time_budget": 120,
    "task": "regression",
    "presets": "standard",
    "augment": false,
    "assay_id": "ASSAY",
    "assay_type": null,
    "credibility_range": {
        "min": null,
        "max": null
    },

Desktop (please complete the following information): Ubuntu 22.04 LTS

Additional context this can be confusing so we need to add clear instructions

HellenNamulinda commented 7 months ago

Hello @GemmaTuron, Did you use zairachem example --file_name input.csv to generate the input file? If yes, by default, that command generates data for a regression task, and we need to correct that in the README.md.

smiles,activity
COc1cc(CCC(C)=O)ccc1O,1.2945919608148597
COc1ccc(CCN)cc1OC,0.669207485449748
C(CN1CCOCC1)Oc1ccc(cc1)-c1cnc2c(cnn2c1)-c1ccccc1,1.1980759270462484
...

zairachem example --classification --file_name input.csv will generate an input file with classification data.

Otherwise, on my end, the task is correctly identified if the input file contains classification data.

{
    "time_budget": 120,
    "task": "classification",
    "presets": "standard",
    "augment": false,
    "assay_id": "ASSAY",
    "assay_type": null,
    "credibility_range": {
        "min": null,
        "max": null
    },
...
GemmaTuron commented 7 months ago

Hi @HellenNamulinda No, I am using a file I made myself. What command did you run to fit the classified data you got with the example command?

miquelduranfrigola commented 7 months ago

@GemmaTuron what is the column name of your file?

GemmaTuron commented 7 months ago

bin

miquelduranfrigola commented 7 months ago

Thanks. This is surprising and is probably a bug.

HellenNamulinda commented 7 months ago

Hi @HellenNamulinda No, I am using a file I made myself. What command did you run to fit the classified data you got with the example command?

@GemmaTuron I used zairachem fit -i train.csv -m model. Because I first ran zairachem split -i input.csv to get the train and test sets.

HellenNamulinda commented 7 months ago

Hi @GemmaTuron and @miquelduranfrigola, This is my observation. If the column name isn't activity(I tried changing it to another name), the split command will fail

File "/home/hellenah/zaira-chem/zairachem/cli/commands/split.py", line 48, in check_dataset_minimum_size
    fold_num_positives = sum(df[df.fold == fold_id].activity)
  File "/home/hellenah/anaconda3/envs/zairachem/lib/python3.10/site-packages/pandas/core/generic.py", line 6204, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'activity'

But for the fit command, if the cut-off value isn’t specified, the first task assigned in data/parameters.json file will be regression(regardless of the column_name). And if you check that file immediately, you see regression as the task.

During the setup step where data preparation is performed, this file will get updated, After all the checks and standardization(once this is logged Descriptor calculation and LSH folding done), the right task is assigned and the parameters.json is updated.

If the cut-off is specified( say zairachem fit -i train.csv -c 0.1 -d low -m model), the parameters.json file will have classification as the default task before any checks on the data are performed, Otherwise, it is regression, which gets updated to classification by the end of the stepup step.

So, before the Describe step(calculating the different descriptors), the correct task will be seen in the parameters.json file. cli/commands/fit.py

miquelduranfrigola commented 7 months ago

Thanks @HellenNamulinda

GemmaTuron commented 5 months ago

Hi @miquelduranfrigola

This issue persists, and I have a dataset only with classification data, which I cannot use as I get stuck while ZairaChem tries to do a regression:

Traceback (most recent call last):
  File "/home/gturon/anaconda3/envs/zairachem2/bin/zairachem", line 33, in <module>
    sys.exit(load_entry_point('zairachem', 'console_scripts', 'zairachem')())
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/cli/commands/fit.py", line 124, in fit
    s.setup()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/training.py", line 233, in setup
    self._tasks()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/training.py", line 175, in _tasks
    SingleTasks(os.path.join(self.output_dir, DATA_SUBFOLDER)).run()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 401, in run
    reg = reg_tasks.as_dict()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 116, in as_dict
    res["reg_raw_skip"] = self.raw(smoothen=True)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 82, in raw
    self._raw = self.smoothen(raw)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 71, in smoothen
    return SmoothenY(self.smiles_list, raw).run()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/utils.py", line 88, in run
    boundaries = self.get_boundaries(y[idxs], repeats, lb, ub)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/utils.py", line 69, in get_boundaries
    boundaries[r] = t
UnboundLocalError: local variable 't' referenced before assignment

@HellenNamulinda what do you mean by that: During the setup step where data preparation is performed, this file will get updated, After all the checks and standardization(once this is logged Descriptor calculation and LSH folding done), the right task is assigned and the parameters.json is updated.

Did you successfully pass a classification data (already binarised) and ZairaChem trained a model?

GemmaTuron commented 5 months ago

mmmm I've been doing tests I found a nan that might be making the _is_a_simple_classification function fail. I think there are enough automated tests provided the user does not have an unexpected value in the dataset - we can close this issue

HellenNamulinda commented 5 months ago

It works just fine on my end. But we can look at it gain if data issues cause the pipeline to break.