[🐅 Epic]: Model Validations

GemmaTuron commented 8 months ago

Summary

The Hub has grown fast in the last months, and as such we need to make sure the models are reproducible and give the expected results, as some small bugs might have happened inadvertently. In this project, we will use the /model-validations repository to make sure the models perform well. There are, at least, three levels of model performance that we can check for:

Model bias (i.e: models giving very high values or low values): to check that, we only need to run predictions for a list of 1000 diverse molecules in each model and plot the results in a scatter plot
Reproducibility: can we reproduce the exact values / a figure / that authors obtained when training the model in the first place? This means we need to read the publication and identify for example a compound identified using that model and check that we obtain the same values
Performance: can we check if the model gives accurate results in external datasets? This is more time consuming and will be done by identifying a public dataset that has not been used in model training, and running predictions to build AUROC curves - to simplify reports, we will only focus on AUROC or R2 as metrics now.

Objective(s)

This project will be done in several steps. The current objectives are:

Set up the model-validations repository structure
Classify the available models in the Hub in groups for testing "together"
Document the pipeline and where data has come from in order to maintain reproducibility for future contributors

Deliverables:

[ ] Compile a diverse dataset of molecules for bias identification
[ ] Classify models in groups for testing together
[ ] Reformat the existing hERG testing as an example of the three levels of testing
[ ] Add information on the README files of the repository

Documentation

No response

GemmaTuron commented 8 months ago

Hi @leilayesufu and @Richiio !

Super exciting news! Having completed a few model incorporations successfully, we are ready to move onto the next stage; a more data science oriented work to ensure model reproducibility. Please see project description. You'll need to work together on that project, which is also fantastic, and I'll help out as well as the other mentors! To start with, let's distribute the first tasks:

@Richiio: you will start by creating the diverse set of 1000 molecules. I propose to incorporate 800 small molecules, sampled randomly from the reference_library (https://github.com/ersilia-os/groverfeat) as well as 200 natural products sampled from Coconut. Those molecules must be: processable by RDKIT and stored as a list of Canonical Smiles / Inchikeys. I also suggest using the standardiser library to ensure molecules are standard. To check the diversity of the dataset, we will use umap and pca techniques, if you never used them, please have a look at their documentation
@leilayesufu you will start by reformating the hERG validation. First, add a README file that summarises the work you have done, and then, let's try to identify one specific example from the original publications that we could reproduce per each model.

I know this is a very large project, the objective for the rest of your internship is just to set this going and get as far as we can, so that new contributors can take it up next.

GemmaTuron commented 8 months ago

I have prepared some structure in the repository already. This project will be team.driven, so it is important to abide by some general rules to avoid conflicts:

Use a conda environment PY3.11 with the package versions installed as specified in the requirements.txt file (add them as you need)
Use good coding practices, including using predefined paths for loading files as we have been doing
If there are functions that we will keep using over and over, let's create scripts with classes to use always the same ones rather than repeating in several notebooks. @leilayesufu will you start us off here using the code from hERG as an example? create a /src folder in the root of the repo and add there the necessary scripts I hope this is clear!

leilayesufu commented 8 months ago

Hi @GemmaTuron,

Thank you for the clear instructions. I'll get started right away on reformatting the hERG validation. I'll begin by updating the README file to summarize our work, incorporating the following sections:

Models Used:

I'll list the models used in this comparison along with their repository links. Additionally, I'll describe the output format of each model and how we interpret it, whether it's probabilities, pIC50 values, etc. Data Acquisition:

I'll detail how we obtained the data for the hERG validation. Combined Data:

I'll explain what the combined data is and how we merged different datasets if applicable. Data Pre-processing:

I'll outline the steps taken for data pre-processing, including converting data to canonical smiles using InChI keys, the featurizing technique employed, and any visualizations performed.

After completing the README, I'll proceed to review each of the hERG publications to identify one specific example from each publication that we can reproduce for each model. Additionally, I'll work on the scripts in the src folder for any reusable tasks.

If I encounter any challenges or have questions during the process, I'll make sure to reach out for assistance.

Looking forward to collaborating with everyone on this project!

DhanshreeA commented 8 months ago

Leaving some fun reading material for both @leilayesufu and @Richiio here to dive deeper into model bias when you get a chance:

https://www.baeldung.com/cs/learning-curve-ml https://towardsdatascience.com/generalization-regularization-overfitting-bias-and-variance-in-machine-learning-aa942886b870

Richiio commented 8 months ago

Thanks @GemmaTuron @miquelduranfrigola @DhanshreeA

For my task, I want to find out how the initial 800 which was gotten from this library (https://github.com/ersilia-os/groverfeat) was standardized. I have been able to process the first 800 molecules, but I want to inquire more on the 200 to be gotten from coconut. Were Salts removed? Did they standardize tautomeric forms? Did they normalize functional groups? Or was there a mix of both done to obtain the SMILES gotten in the end.

Or has standardization not been done on the first 800 and I am meant to decide on how to standardize it?

GemmaTuron commented 8 months ago

Hi @Richiio

Those were not standardised in any particular manner. The dataset is already on the model-validations under data and contains ~200k molecules.

Richiio commented 8 months ago

@GemmaTuron @miquelduranfrigola @DhanshreeA

I have curated the dataset of 1000 rows containing 800 rows of data from (https://github.com/ersilia-os/groverfeat) and 200 rows of data from coconut. Compounds gotten from coconut were randomized and all within a molecular weight of 800. They were all processible by rdkit.

Random 800 from Grover sampled_smiles.csv

Random 200 from Coconut final_200.csv

The combined dataset is seen here Validation.csv

We have three columns, the SMILES, Canonical SMILES and their INCHIKeys

The notebook containing data preprocessing done on the data can be found here: https://colab.research.google.com/drive/1H7yVFSyIC5GnnwEiGtqzDnJLGBxKLNdm?usp=sharing

This would be uploaded to the model validation repo folder

GemmaTuron commented 8 months ago

Hi @Richiio

Good start, but please work on the repository, I created a notebook specifically for that task already (/notebooks/00_...), and added the reference_library in the /data folder, load it from there so we always have the source data in the same place. I did add the header to make it easier to read. So please, move the code from the Colab to the repo and once it is ready open a PR. Before doing so, I suggest to prepare the UMAP and PCA representations (functions of the umap package and sklearn package respectively) to see how the molecules are distributed. It would be interesting to paint the synthetic molecules and the natural products in different colours. Plenty of examples on the internet of how to do that, for example: https://blog.reverielabs.com/mapping-chemical-space-with-umap/ - give it a try and let us know !

leilayesufu commented 8 months ago

@GemmaTuron Good morning. For the reproducibility in eos2ta5. The authors tested the model with 3 different test sets provided against other models. Here's the result we could try to reproduce. https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00541-z/tables/4

Link to the test sets. Test set-I Test set-II Test set-III

Richiio commented 8 months ago

@GemmaTuron

I standardized the data using Morgan fingerprints and got a vector of the smiles which were used for the statistical analysis with Umap and PCA. The plot has been created and a pull request will be made before the end of today. I currently took my system for a check and would have access to it by the end of today.

GemmaTuron commented 8 months ago

Hi @Richiio I did not see the PR, let us know once it is ready @leilayesufu that is a good suggestion, let's see if we cna find the same for the other models

DhanshreeA commented 8 months ago

@GemmaTuron Good morning. For the reproducibility in eos2ta5. The authors tested the model with 3 different test sets provided against other models. Here's the result we could try to reproduce. https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00541-z/tables/4

Link to the test sets. Test set-I Test set-II Test set-III

@leilayesufu can you share the nuances of each of the test sets you have cited above? Eg if a set is has more negative instances, or more positive instances.

leilayesufu commented 8 months ago

For the eos4tcc in the publication in page 7-8, the author compared the proposed bayeshERG model with other models using the external datasets provided here this could serve as the reproducibility for eos4tcc

leilayesufu commented 8 months ago

@DhanshreeA in response to this. https://github.com/ersilia-os/ersilia/issues/966#issuecomment-1952277187 The figure below shows the classification of the test sets.

DhanshreeA commented 8 months ago

Hi @Richiio I have reviewed your PR: https://github.com/ersilia-os/model-validations/pull/10. Please resolve the issues present in your work, it should not take very long as they're mostly very small changes. Thanks.

DhanshreeA commented 8 months ago

Hi @leilayesufu , the publication link that you have provided is broken and I cannot view it. In any case, can you provide more insight into data summary statistics, including the comparison tests the authors have performed. Also, please do the same for the remaining two models as well. Thank you.

GemmaTuron commented 8 months ago

Hi @leilayesufu !

Document in the README file what exact test you are going to do per model
Add the data in the hERG/data folder
Create a notebook with all the reproducibility experiments

Richiio commented 8 months ago

@GemmaTuron @DhanshreeA

The models currently in the model hub that deals with cytotoxicity are: eos42ez eos7kpb eos43at eos481p eos92sw eos3le9

GemmaTuron commented 8 months ago

Hi @Richiio !

Agree with all except eos43at -- @leilayesufu this should be incorporated into the hERG testing, as is a hERG cardiotoxicity model - can you do that?

Sarima, I've re-set the workflows so that the models are updated and can be used - you can start by structuring the repository folder for /cytotox and the notebooks, data and so that you will use.

Richiio commented 8 months ago

@GemmaTuron I would do just that. I further checked the hub and got these additional four: eos4cxk eos69p9 eos2fy6 eos9yy1

GemmaTuron commented 8 months ago

Hi @Richiio

eos4cxk - this is an anti covid model, so I would not classify it as cytotox even though it has some cytotoxicity models

The others, yes, let's go for them as well. I've set the updated workflows

Richiio commented 8 months ago

Thanks @GemmaTuron I'll get started on the eight models.

Richiio commented 8 months ago

@GemmaTuron

For this model, it was trained by Ersilia but the data used for training wasn't provided eos3le9

For these models, I couldn't find the dataset. eos2fy6 eos9yy1

GemmaTuron commented 8 months ago

Hi @Richiio

We first will do the model bias check, which means running the 1000 compounds against all models and simply producing scatter plots, we want to see if there is very extreme values, for example. For this we do not need the data. Next we will focus on model reproducibility - identify one example per model (those that have publications attached, eos3le9, no, for example) that we can reproduce, as Leila has done

leilayesufu commented 8 months ago

Hi @Richiio !

Agree with all except eos43at -- @leilayesufu this should be incorporated into the hERG testing, as is a hERG cardiotoxicity model - can you do that?

Sarima, I've re-set the workflows so that the models are updated and can be used - you can start by structuring the repository folder for /cytotox and the notebooks, data and so that you will use.

@GemmaTuron eos43at was part of the hERG models that we used for validation but couldn't get the dataset for and we had to proceed without. I can add it for the model reproducibility though.

Richiio commented 8 months ago

@GemmaTuron

Task update: I've been able to get predictions.csv file for five of the models, the remaining three, I had to install torch which has been given me a HTTPS Connection error, I would retry the installation at night when I would get a better connection and create a scatterplot of the eight models.

GemmaTuron commented 8 months ago

Hi @Richiio

Remember you can use Codespaces. That would solve the problems for downloading packages

DhanshreeA commented 8 months ago

@GemmaTuron

Task update: I've been able to get predictions.csv file for five of the models, the remaining three, I had to install torch which has been given me a HTTPS Connection error, I would retry the installation at night when I would get a better connection and create a scatterplot of the eight models.

Hi @Richiio could you help me understand why you needed to install torch? The models should be self contained and should not require additional dependency installation from the user's end (that's you). If it happened, we should flag this. Also could you mention which three models you had to do this for?

Richiio commented 8 months ago

@DhanshreeA

That was a fault on my end. I was trying to reproduce the environment in my local as if they hadn't been incorporated which was a fault on my end. I could easily have ran the ersilia fetch command which was what I did on codespaces

DhanshreeA commented 8 months ago

Hi @Richiio I have merged your PR mainly because I do not want it to grow uncontrollably large, but there is a lot more work to be done.

Here are my comments and suggestions, please open a new PR for incorporating these.

[ ] Let's open an issue each for eos42ez and eos481p with the error logs you get (locally) or in codespaces when you try to run them. It doesn't need to be an error if the models are simply stuck. Regardless let's open an issue.
[ ] For eos481p, you do not need to include all the datasets that the authors worked with. For example, bace, bbbp etc do not relate to toxicity. If you read the model README clearly, only the toxcast panel is incorporated within the model.
[ ] For eso2fy6, the Ersilia contributor who incorporated the model collected data from ChEMBL. We can also get a small set of molecules from HepG2 IC50 data for this model. I would do the other tasks first and come to this later.
[ ] For eos3le9: @GemmaTuron should we drop this model from the reproducibility analysis since I am guessing the dataset is proprietary/private?
[ ] For eos69p9: It uses Tox21 which is a set out of different toxicity assays which is developed and maintained by NCATS (you can read more about it here: https://ncats.nih.gov/research/research-activities/Tox21/assays) . I would suggest making a scatterplot corresponding to each of the columns from the output.
[ ] For eos7kpb: This model also outputs many different things, few of them being cytotoxicity related outcomes. Can you list the outcomes (column names) here? I see that the original repository had data for CYP (which I don't think is related to toxicity at all? @GemmaTuron can better clarify). However there is hepg2 in the outcome, which is actually what we need data for. @GemmaTuron any help here, since you've worked on this? Or should we drop it from the reproducibility task for now?
[ ] For eos92sw: The data you have included for this model seems to be a training dataset, however for checking reproducibility we primarily focus on test datasets. Can you find a test dataset for this model from the publication?
[ ] For eos9yy1: The paper says only the training dataset from NCATS hasn't been made available publicly however the validation dataset should be accessible. Let me read the paper and come back to you.

DhanshreeA commented 8 months ago

Bringing @GemmaTuron's suggestions from the PR to here: https://github.com/ersilia-os/model-validations/pull/11#issuecomment-1958851879 @leilayesufu let's do the following as Gemma has suggested:

eos30f3: test on CodeSpaces - collect the log file if it cannot be fetched to understand what might be happening
Use Sarima's dataset for assessing model bias (open a new PR for this)
Add useful functions into an /src folder (open a new PR for this)

Richiio commented 8 months ago

For eos7kpb, we have the following column names in the output:

pf_nf54 | pf_k1 | mtb | cho | hepg2 | clint_h | clint_m | clint_r | caco_2 | aq_sol | cyp2c9 | cyp2c19 | cyp3a4 | cyp2d6 | pf_nf54_norm | pf_k1_norm | mtb_norm | cho_norm | hepg2_norm | clint_h_norm | clint_m_norm | clint_r_norm | caco_2_norm | aq_sol_norm | cyp2c9_norm | cyp2c19_norm | cyp3a4_norm | cyp2d6_norm -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

GemmaTuron commented 1 month ago

I will close this issue and if we work on model-validations I suggest we continue the discussion in the appropriate repo

ersilia-os / ersilia