Open GemmaTuron opened 1 month ago
Hi @miquelduranfrigola and @GemmaTuron, please see the results detailed below that investigate how Olinda behaves under various training conditions. There is a lot of detail, which I'll try to describe here, but we may also want to discuss some things on a call further.
ZairaChem Models I performed all the tests on two H3D datasets, one larger and one smaller: 1) H3D Plasmodium NF54 using a <0.5 uM cutoff The training data to June 2023 was 5691 compounds and the date-split test data to Jan 2024 was 358 compounds. Prospective AUROC: 0.82 2) H3D Caco using a >10 uM cutoff The training data to June 2023 was 314 compounds and the date-split test data to Jan 2024 was 29 compounds. Prospective AUROC: 0.82
Coincidentally they had the same ROC-AUC but I've included the ROC curves at the end.
Test Parameters The test conditions had three dimensions to them which I detail below: a) training schema, b) weighting schema, c) number of public reference library compounds
Training Schema I experimented with different ideas in the Olinda pipeline to see how they affected model performance (always including the grover reference library compounds). 1) 3 Epoch: The surrogate model is given 3 epochs of training time to learn the zairachem scores. 2) 30 Epoch: Training epochs increased to a maximum potential of 30 with no other changes. The remaining models also use 30 epochs. 3) Zaira and Reference Validation: Instead of just using the original zairachem training data for validation, here I include the more general reference library compounds for validation too. The goal would be to increase generality without losing performance on the original task. 4) More Confident Zaira Preds: Here I explored the affect of artificially adjusting the zairachem predictions in the range of 0.3 to 0.7 to be more extreme. Meaning that Olinda trained on predictions that were more confident (but with less dynamic range). 5) Correct Wrong Zaira Preds: Seeing as we have both the original training data for the ZairaChem model as well as the corresponding predictions, we can correct the incorrect predictions before passing the values to Olinda for surrogate training. 6) Zaira True Binary: Here I explored the effect of training Olinda on the original true binary values of the ZairaChem training set instead of the prediction scores.
Weighting Schema There are two sources of data (Zaira training set and Grover reference library) and two classes (active/inactive) for the binary classification models. These can be weighted relative to each other during training. a) Unweighted: Data is not weighted b) Train weighted: The zaira training set compounds are weighted inversely to the number of compounds used from the grover reference library set. This is too maintain emphasis on the original scope of the ZairaChem model. c) Class weighted: The actives and inactive (based on ZairaChem prediction scores with a threshold of 0.5) are weighted according to the ratio of the size of each class to address class imbalance. d) WeightedAdd: Both train_weighting and class_weighting are implemented and the weights are combined additively. e) WeightedMultiply: Both train_weighting and class_weighting are implemented and the weights are combined through multiplication of weights.
Number of public reference library compounds Each combination above was tested with 1k, 10k, 50k, 100k grover reference library compounds.
Results Models seemed to get the general order of compounds correct based on the ROC-Curves (good) so I've focused more on the R^2 correlation between ZairaChem and Olinda scores as a clearer way of deciding on the best initial setup.
Summary I think it makes sense to go with training_schema 5 where the incorrect ZairaChem predictions are corrected before being used in Olinda. Based on the metrics between the, it's not particularly clear which weighting scheme to use so I'd either stick with the unweighted approach or a class_weighted setup.
Future Work If we do agree on one of these two setups, then I'll run the tdc benchmarks with that paradigm and also test the H3D panel of models too.
I think there are still two open questions to explore but perhaps they should come after we settle on version 1. 1) Generalisation: Are these surrogates better/worse/equal than the ZairaChem models for out-of-scope test data? 2) Worth the effort: How do these surrogates compare to a neural network that just trains on the original training set directly? (i.e. how much value is there in doing ZairaChem and then Olinda versus just a small scikit-learn multi-layer perceptron).
Happy to hear your thoughts/questions. It's a lot of information so hopefully I've captured it all but we can organise a call too.
Thank you so much, @JHlozek, this is remarkable stuff, and thanks for nicely reporting on it. If you agree, let's discuss this on a call (Monday 21/10) and then I will reflect might thoughts on a written form.
After our discussion, we have chosen to proceed with an initial configuration of:
Thanks @miquelduranfrigola and @JHlozek. I agree with the above, we will need to translate this into clear documentation for olinda users - what is the default set up and where could they change it if needed. Do you think @JHlozek we should do a section on the readme of Olinda for that?
I've now reflected these choices at the end of the README and included relevant code snippets for where the parameters are defined,
@JHlozek is looking into the best parameters for the ZairaChem distillation of Olinda