Closed GemmaTuron closed 5 months ago
Hello @GemmaTuron,
Did you use zairachem example --file_name input.csv
to generate the input file?
If yes, by default, that command generates data for a regression task, and we need to correct that in the README.md.
smiles,activity
COc1cc(CCC(C)=O)ccc1O,1.2945919608148597
COc1ccc(CCN)cc1OC,0.669207485449748
C(CN1CCOCC1)Oc1ccc(cc1)-c1cnc2c(cnn2c1)-c1ccccc1,1.1980759270462484
...
zairachem example --classification --file_name input.csv
will generate an input file with classification data.
Otherwise, on my end, the task is correctly identified if the input file contains classification data.
{
"time_budget": 120,
"task": "classification",
"presets": "standard",
"augment": false,
"assay_id": "ASSAY",
"assay_type": null,
"credibility_range": {
"min": null,
"max": null
},
...
Hi @HellenNamulinda No, I am using a file I made myself. What command did you run to fit the classified data you got with the example command?
@GemmaTuron what is the column name of your file?
bin
Thanks. This is surprising and is probably a bug.
Hi @HellenNamulinda No, I am using a file I made myself. What command did you run to fit the classified data you got with the example command?
@GemmaTuron
I used zairachem fit -i train.csv -m model
. Because I first ran zairachem split -i input.csv
to get the train and test sets.
Hi @GemmaTuron and @miquelduranfrigola, This is my observation. If the column name isn't activity(I tried changing it to another name), the split command will fail
File "/home/hellenah/zaira-chem/zairachem/cli/commands/split.py", line 48, in check_dataset_minimum_size
fold_num_positives = sum(df[df.fold == fold_id].activity)
File "/home/hellenah/anaconda3/envs/zairachem/lib/python3.10/site-packages/pandas/core/generic.py", line 6204, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'activity'
But for the fit command, if the cut-off value isn’t specified, the first task assigned in data/parameters.json
file will be regression(regardless of the column_name). And if you check that file immediately, you see regression as the task.
During the setup step
where data preparation is performed, this file will get updated,
After all the checks and standardization(once this is logged Descriptor calculation and LSH folding done
), the right task is assigned and the parameters.json is updated.
If the cut-off is specified( say zairachem fit -i train.csv -c 0.1 -d low -m model
), the parameters.json file will have classification as the default task before any checks on the data are performed, Otherwise, it is regression, which gets updated to classification by the end of the stepup step.
So, before the Describe step
(calculating the different descriptors), the correct task will be seen in the parameters.json file.
cli/commands/fit.py
Thanks @HellenNamulinda
Hi @miquelduranfrigola
This issue persists, and I have a dataset only with classification data, which I cannot use as I get stuck while ZairaChem tries to do a regression:
Traceback (most recent call last):
File "/home/gturon/anaconda3/envs/zairachem2/bin/zairachem", line 33, in <module>
sys.exit(load_entry_point('zairachem', 'console_scripts', 'zairachem')())
File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/cli/commands/fit.py", line 124, in fit
s.setup()
File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/training.py", line 233, in setup
self._tasks()
File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/training.py", line 175, in _tasks
SingleTasks(os.path.join(self.output_dir, DATA_SUBFOLDER)).run()
File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 401, in run
reg = reg_tasks.as_dict()
File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 116, in as_dict
res["reg_raw_skip"] = self.raw(smoothen=True)
File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 82, in raw
self._raw = self.smoothen(raw)
File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 71, in smoothen
return SmoothenY(self.smiles_list, raw).run()
File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/utils.py", line 88, in run
boundaries = self.get_boundaries(y[idxs], repeats, lb, ub)
File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/utils.py", line 69, in get_boundaries
boundaries[r] = t
UnboundLocalError: local variable 't' referenced before assignment
@HellenNamulinda what do you mean by that: During the setup step where data preparation is performed, this file will get updated, After all the checks and standardization(once this is logged Descriptor calculation and LSH folding done), the right task is assigned and the parameters.json is updated.
Did you successfully pass a classification data (already binarised) and ZairaChem trained a model?
mmmm I've been doing tests
I found a nan that might be making the _is_a_simple_classification
function fail. I think there are enough automated tests provided the user does not have an unexpected value in the dataset - we can close this issue
It works just fine on my end. But we can look at it gain if data issues cause the pipeline to break.
Describe the bug If a binary classification file is passed, with the activity column already in binary, and the following command is run:
zairachem fit -i input.csv -m model_folder
ZairaChem interprets it as a regression, not a binary classification, as indicated by thedata/parameters.json
file:Desktop (please complete the following information): Ubuntu 22.04 LTS
Additional context this can be confusing so we need to add clear instructions