MolecularAI / QSARtuna

QSARtuna: QSAR model building with the optuna framework
69 stars 14 forks source link

ChemProp classifier warnings #4

Open WoutVanEynde opened 3 months ago

WoutVanEynde commented 3 months ago

When I use the following code:

from optunaz.utils.preprocessing.splitter import Random from optunaz.utils.preprocessing.deduplicator import KeepMedian from optunaz.config.optconfig import ChemPropHyperoptClassifier from optunaz.descriptors import SmilesBasedDescriptor, SmilesFromFile

config = OptimizationConfig( data=Dataset( input_column="Smiles", # Smiles column. response_column="Activity", # Activity column. training_dataset_file="reinvent4_preparation.csv", # This will be split into train and test. split_strategy=Random(fraction=0.2), deduplication_strategy=KeepMedian(), ), descriptors=[ SmilesFromFile.new(), ], algorithms=[ ChemPropHyperoptClassifier.new(epochs=100, num_iters=2), #num_iters>2: enable hyperopt within ChemProp trials ], settings=OptimizationConfig.Settings( mode=ModelMode.CLASSIFICATION, cross_validation=5, n_startup_trials=50, n_trials=300, direction=OptimizationDirection.MAXIMIZATION, ), )

Setup basic logging. import logging from importlib import reload reload(logging) logging.basicConfig(level=logging.INFO)

Avoid decpreciated warnings from packages etc import warnings warnings.simplefilter("ignore") def warn(*args, **kwargs): pass warnings.warn = warn

study = optimize(config, study_name="BPA_ChemProp_hyperopt")

I get the following warnings:

/home/wout/.anaconda/envs/Qptuna/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1497: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior. _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result)) /home/wout/.anaconda/envs/Qptuna/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1497: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior. _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result)) /home/wout/.anaconda/envs/Qptuna/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1497: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior. _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

I am not sure if this is a problem. I also wonder what the optimal amount of epochs, trials and etc are, because it feels like in the tutorial everything is small for the sake of computational power. I would think that the amount of trials should be as high as possible, but when I use it for ChemProp, all trials have the same outputvalue.

lewismervin1 commented 3 months ago

Thanks for the interest in QSARtuna (formally Qptuna).

It seems like you may have a highly imbalanced dataset?

I would recommend trying the same run using the ChemPropClassifier (not the hyperopt version of ChemProp). The hyperopt version ChemPropHyperoptClassifier normally requires large numbers of n_iters, n_trials and epochs during training, see here, whereas the ChemPropClassifier will work quite well out of the box.

Can you try with the following instead in your config?:

algorithms=[
ChemPropClassifier.new(epochs=100)
]
WoutVanEynde commented 3 months ago

Thank you for your quick response,

I still get the same warnings, maybe it is because of the dataset? I have a dataset of 99 inactives and 33 actives. I used 20% as test data. I use Boolean values for activity, and I get the same warnings when using integers as input.

Might it be that the data is too small then to have 5 times cross validation?

I just checked within the tutorial and the same warning are being generated when using the ChemProp classifier.

lewismervin1 commented 3 months ago

@WoutVanEynde yes that may be really quite small a dataset to then perform CV 5 times. ChemProp training will also perform another splitting approach during training, too.

You could try to reduce the number of cross validation folds from 5. Perhaps you could even try using the default ChemProp parameters, to do that you could just set the CV to 1 fold, and the number of trials to 1. This would essentially apply the default (sensible) parameters to the ChemPropClassifier (as defined by the original ChemProp authors) and evaluate on the test fold.

WoutVanEynde commented 3 months ago

Thank you for your response, it seems that when increasing the amount of trials something goes wrong as my predictive values only go up to 0.5 when the warning occurs. This warning looks like something that is explained here: https://stackoverflow.com/questions/54150147/classification-report-precision-and-f-score-are-ill-defined

But I have rechecked my input data and I only have 0 and 1 as activity. When keeping my trial on 1 or on a low value, it seems to predict correctly. At some random point it seems to mess up with increasing trials.

This might be a bit off track but are there any papers/literature you can recommend for beginners in machine learning/AI?

Thanks again for the information and help!

lewismervin1 commented 3 months ago

Yes, I see. I do not think it is your input data. This may be caused by your choice of splitting strategy "Random". This will not ensure that both 0 and 1 training instances are retained in the test & train splits. Can you try repeating with a stratified split e.g. using "Stratified(fraction=0.2)"?

WoutVanEynde commented 3 months ago

That seems to give the same problems, even with CV = 1

lewismervin1 commented 3 months ago

It would be good to check your processed data, can you set save_intermediate_files to True and check in the test and train csv produced by QRSARtuna, for how many compounds are contained in each?