Very low r2 - Githubissues

jhjensen2 commented 3 years ago

Not sure if this is the best forum for this but here goes.

I have used chemprop for several IC50 data sets and run in to the same problem: the RMSE is OK, but the R2 is very bad. I get better results using fingerprints and RF or something similar. I am wondering if I am doing something wrong or if it's just my small data set (500 points). The code is here

hesther commented 3 years ago

From just the first glance from the plot of predicted vs true labels I would assume something went wrong here - I will look into it and try to reproduce your results, and report back to you here. Thanks for providing the full notebook, that will speed up troubleshooting considerably.

hesther commented 3 years ago

Here are my first observations. First of all, I could reproduce your findings using chemprop within a python script and via the command line - so you are not doing anything wrong. I then looked into the bad performance of the model further: When troubleshooting an ML model, it is immensely helpful to look at the distribution of the target data, as well as training and test set error:

For your pIC50 data, the mean is 5.14 and the standard deviation 0.76. So, the very simple (dumb) model which just predicts 5.14 regardless of the actual SMILES input will have an RMSE of 0.76. When I retrained the model:

training set: RMSE 0.71
test set: RMSE 0.67

What looks like an acceptable RMSE is actually terrible when we compare it to the baseline of 0.76, the model basically learns nothing and predicts values closely around the mean 5.14 for all inputs. Training and test set perform equally bad. As a rule of thumb (for any ML, not only Chemprop):

Training error low, test error high: Overfitting model, need to add regularization (e.g. dropout, early stopping, choose simpler model, ...), add more data
Training error high, test error high: Underfitting model, usually representation of the input not suitable, or choice of model too simple, or too much regularization. Usually does not really help much to add more data.
Training error low, test error low: Good model

So in this case, we clearly have a quite severe case of underfitting. This happens when the chosen representation is not suitable, the model has too few degrees of freedom or the architecture is not suitable. Chemprop has a lot of degrees of freedom with your chosen hyperparameters, so I would assume that the fault is with the representation - i.e. the data is not a function of local atom environments (which the MPNN layer is basically producing). To test my assumption, I added additional features to the MPNN output (via --features_generator rdkit_2d_normalized --no_features_scaling) which led to:

training set: RMSE 0.63
test set: RMSE 0.58

which is still not good, but at least we are moving in the right direction. I therefore think you would need to conduct a hyperparameter search regarding parameters which change the representation (atomic/molecular features, aggregation, depth of message passing) if you want to use Chemprop, since the standard representation of Chemprop is obviously not working for your data. Without knowing more about the data, it is difficult to give good advice here. You can start by asking yourself what properties of a molecule give it a high/low score from your point of view / chemical intuition (e.g. certain charged groups, 3D structure, local atom environment, ...) and try to use a representation which incorporates some or all of those.

jhjensen2 commented 3 years ago

Thank you very much for your answer! Very useful.

chemprop / chemprop

Very low r2 #146