hERG model validation - Githubissues

GemmaTuron commented 9 months ago

Summary

The Ersilia Model Hub features several hERG models. We'd like to compare its performances using an unbiased dataset obtained from the literature

Scope

Batch 🐕

Objective(s)

Build a dataset that contains hERG positive and negative data (from ChEMBL for example)
Check if the molecules we will use for testing have been used in the training of the hERG Models in the Hub
Test all the hERG models in the Ersilia Model Hub with the dataset and compare performances

Team

Role & Responsibility	Username(s)
Data Scientist	@leilayesufu
Mentor	@DhanshreeA
Project Manager	@GemmaTuron

Timeline

When the open model tasks are completed

Documentation

No response

GemmaTuron commented 9 months ago

Hi @leilayesufu !

Following today's discussion, and given that you worked on the last hERG model I think this can be a good continuation of the work. Have a look and let me know

leilayesufu commented 9 months ago

Thank you @GemmaTuron I'll start by identifying the hERG models in the Ersilia Model Hub.

DhanshreeA commented 9 months ago

Hey @leilayesufu an easy way to go about doing that is to visit the Model Hub page on the Ersilia website https://www.ersilia.io/model-hub and click on the hERG tag. That shows all the hERG models currently incorporated within the hub. From what I can see now, there are five such models.

leilayesufu commented 9 months ago

Thank you! @DhanshreeA I've identified the hERG models incorporated in the hub

eos2ta5 eos4tcc eos30f3 eos30gr eos43at

GemmaTuron commented 9 months ago

Great @leilayesufu

Steps now:

Identify which datasets they use and if they are open, compare to see if some are redundant (remember smiles can have different forms, so always use the canonical smiles or even better, the inchikey)
fetch the models and make sure they work
create a test dataset that contains herg + and herg - molecules (ideally, that are not in the training set of the models)

Let me know for point 1 how many datasets you can find

leilayesufu commented 9 months ago

1) eos2ta5 Datasets used: Training Dataset: https://github.com/Abdulk084/CardioTox/blob/master/data/train_validation_cardio_tox_data.tar.xz The positively biased test set: https://github.com/Abdulk084/CardioTox/blob/master/data/external_test_set_pos.csv The negatively biased test set: https://github.com/Abdulk084/CardioTox/blob/master/data/external_test_set_neg.csv Relatively larger negatively biased test set: https://github.com/Abdulk084/CardioTox/blob/master/data/external_test_set_new.csv

2) eos4tcc Datasets used: Pretraining datasets: https://github.com/GIST-CSBL/BayeshERG/tree/main/data/Pretraining Finetuning Datasets: https://github.com/GIST-CSBL/BayeshERG/tree/main/data/Finetuning External datasets the model was validated upon: https://github.com/GIST-CSBL/BayeshERG/tree/main/data/External

3) eos30gr Datasets used: checkpoints unavailable Retrained

4) eos30f3 Datasets used: Training dataset: https://github.com/AI-amateur/DMPNN-hERG/tree/main/Figure2/[Cai_TableS3_fixed.csv](https://github.com/AI-amateur/DMPNN-hERG/blob/main/Figure2/Cai_TableS3_fixed.csv)

5) eos43at The dataset in the original model was available for download so i downloaded them here https://github.com/leilayesufu/log_files/tree/master/dataset

GemmaTuron commented 9 months ago

Hi @leilayesufu

That's great. Can you create a small repo with the following structure? /data /notebooks

In data you can add all the datasets (maybe use the eosID for each one so we know what are we talking about) and in notebooks open a jupyter notebook where you load and compare the different datasets, let's see if there are repeated smiles or not - remember to compare molecules using unique identifiers like inchikey Rdkit is the best package to convert SMILES

GemmaTuron commented 9 months ago

make sure to create a public repository so we can see it :)

leilayesufu commented 9 months ago

Good day @GemmaTuron https://github.com/leilayesufu/hERG_Datasets/tree/main

I get some warnings when running the notebook though, I'm trying to rectify that

GemmaTuron commented 9 months ago

Good start @leilayesufu !

I suggest:

Renaming the datasets so its easier to load them
Merge train and test for each model - one csv file with SMILES - INCHIKEY per model
Compare datasets across models
Find our test set in pubchem for example: https://pubchem.ncbi.nlm.nih.gov/

leilayesufu commented 9 months ago

I've successfully done the first three points. https://github.com/leilayesufu/hERG_Datasets/tree/main

GemmaTuron commented 9 months ago

Great @leilayesufu can you give some numbers on the dataset comparison?

leilayesufu commented 9 months ago

Good morning @GemmaTuron

I combined all the datasets so that each ersilia model has only one csv dataset as seen, i also added a column for their inchikeys.

Then i compared the datasets across all the models based on their InChIKeys, and printed the duplicate entries across all datasets. in this jupyter notebook. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/compare_datasets.ipynb

I then combined the 4 models datasets together and remove all the duplicates here. https://github.com/leilayesufu/hERG_Datasets/blob/main/data/combined_dataset_eos.csv

I used chembl and pubchem to create a test dataset. https://github.com/leilayesufu/hERG_Datasets/blob/main/data/test_dataset.csv

i compared the combined ersilia datasets and the test dataset from chembl and removed all the matching ones so we can have an unbiased test set a seen here. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/compare_eos_and%20test.ipynb

Our test set has no similar molecules with the combined ersilia dataset

GemmaTuron commented 9 months ago

Hi @leilayesufu Good work, please provide here a report of how many molecules were duplicated within datasets, as well as where did you get the training sets from precisely. Which files did you donwload from ChEMBL, and how many were redundant with our training sets? How many active and inactive do we have in our test set? Where they all binary results or you applied a threshold?

leilayesufu commented 9 months ago

From the notebook in cell 2, it can be seen that the Number of molecules duplicated within datasets: 49319. The training sets were gotten from the original models github repository. This is tests datasets i downloaded from chembl DOWNLOAD-roXaq-GSJT-Xw3YZ_HzRIzek-9lQbvPnOrDa-suZDkM=.csv

Most didn't have an activity column, so i did some digging on hERG models on github and their datasets to get hERG models with activity. This model had hERG compounds and their activity so i merged the training_set.csv and validation_set.csv and compared it with the combined ersilia's dataset and removed the molecules that were already in the combined_eos_dataset https://github.com/ncats/herg-ml/tree/master/data/train_valid

GemmaTuron commented 9 months ago

Hi @leilayesufu !

Sorry, maybe I did not explain myself good enough. It would be interesting to know the overlap between the different models (for example, are all molecules from eos4tcc and eos43at repeated whereas none is repeated with eos2ta5?) This will help in understanding the results we get in the validation. Could you modify your code to obtain these numbers? The

I think the test dataset is a good find, thanks! Would also be interesting to have a more refined understanding of the data - there are not duplicates even in chembl data? Also, it would be helpful to know the proportion of active and inactive (0 and 1)

Once this is complete, it should be just a matter of getting the different model predictions and comparing the results between models

leilayesufu commented 9 months ago

@GemmaTuron I've updated the code to do that. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/compare_datasets.ipynb

Overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos30f3/eos30f3.csv based on InChIKey: 7575 molecules
Overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos4tcc/eos4tcc.csv based on InChIKey: 13515 molecules
Overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos43at/eos43at.csv based on InChIKey: 2737 molecules
Overlap between ../data/eos30f3/eos30f3.csv and ../data/eos4tcc/eos4tcc.csv based on InChIKey: 8533 molecules
Overlap between ../data/eos30f3/eos30f3.csv and ../data/eos43at/eos43at.csv based on InChIKey: 2391 molecules
Overlap between ../data/eos4tcc/eos4tcc.csv and ../data/eos43at/eos43at.csv based on InChIKey: 5543 molecules

I also created a python script to check the number of 1s and 0s in the test dataset.

Number of 1s: 447
Number of 0s: 2582
Ratio of 1s: 14.76%
Ratio of 0s: 85.24%

GemmaTuron commented 9 months ago

perfect, thanks @leilayesufu ! Now let's see what model performances you get using our test dataset and the 4 models from Ersilia. I suggest saving the results in /data and then creating a notebook to evaluate the results (AUROC, and contingency tables, probably the best?)

leilayesufu commented 9 months ago

Good morning @GemmaTuron I have done this so far. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/evaluate_model.ipynb

The test dataset contains 3029 compounds and it took a long time to process so i extracted 100 compounds with 50 1s and 50 0s to create a new test dataset.
https://github.com/leilayesufu/hERG_Datasets/blob/main/data/updated_test_dataset.csv

DhanshreeA commented 9 months ago

Hi @leilayesufu the next set of tasks that you would need to do before we conclude validations of hERG models and can move them into the destined repository:

Find and understand the experimental cutoffs used within each of the models that is part of this validation. Generally just reading the README within the model repository should suffice. You are looking for something called https://mscreen.lsi.umich.edu/mscreenwiki/index.php?title=What_is_pIC50_or_pAC50%3F (it could be a different metric as well, but I have generally seen this metric being used for determining the binary outcome within these hERG models.) This is something known as use of domain expertise in machine learning when you use some external knowledge to determine what a class label should be for a given sample.
Get percentage of overlap: Do this in two ways so that we are thorough: Get inchi keys common to ALL the models' datasets. For example if you have 100 molecules common between all the models (model a, model b, model c), and model a has 1500 molecules, model b has 1000 molecules, model c has 200 molecules in their training datasets respectively, then the percentage of overlap between all of three of them would be (100/1500, 100/1000, 100/200) respectively The other way I want you to do it is pairwise. For example percentage overlap between model a and model b datasets, model b and model c, and so on...
For the same set of test molecules (the test data that you have from pubmed), make a scatter plot of all the results you get from each of the models. You can look at this tutorial for doing that: https://realpython.com/visualizing-python-plt-scatter/
AFTER you are done with the three points above, you can move your analysis and your results in the repository that Gemma created and posted in the slack channel.

Let me know how it's going and if you have any questions.

leilayesufu commented 9 months ago

Good afternoon.

For the eos2ta5 model, the cutoff for hERG blockade in the classifier was set at IC50 < 10 μM. In the case of eos30f3, the authors opted for a specific cutoff, establishing it at 10 μM for hERG blocking activity. Similarly, eos4tcc utilized an IC50 threshold of 10 μM for classification into positives and negatives. For eos30gr, the experimental cutoff used in the training dataset for the hERG model was 80% inhibition.

Regarding the overlap between models, the percentage overlap for each model with all others was calculated as follows:

Percentage overlap for ../data/eos2ta5/eos2ta5.csv with all models: 2.80%
Percentage overlap for ../data/eos30f3/eos30f3.csv with all models: 4.77%
Percentage overlap for ../data/eos4tcc/eos4tcc.csv with all models: 0.12%
Percentage overlap for ../data/eos43at/eos43at.csv with all models: 2.28%

Additionally, the percentage overlap between specific models was determined:

Percentage overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos30f3/eos30f3.csv: 89.61%
Percentage overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos4tcc/eos4tcc.csv: 87.00%
Percentage overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos43at/eos43at.csv: 6.22%
Percentage overlap between ../data/eos30f3/eos30f3.csv and ../data/eos4tcc/eos4tcc.csv: 90.05%
Percentage overlap between ../data/eos30f3/eos30f3.csv and ../data/eos43at/eos43at.csv: 5.43%
Percentage overlap between ../data/eos4tcc/eos4tcc.csv and ../data/eos43at/eos43at.csv: 9.41%

This can be seen in the notebook below: https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/compare_datasets.ipynb

3) Furthermore, scatter plots based on accuracy, precision, and F-1 score for the five different models have been created. These visualizations can be accessed in the notebook: https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/evaluate_model.ipynb

GemmaTuron commented 9 months ago

Hi @leilayesufu

That is helpful, thanks for the tips @DhanshreeA . The training sets of eos30f3, eos2ta5 and eos4tcc are practically the same, so hopefully the performances of these models should be the same as well. Cut offs are all at 10 uM for IC50 which is good.

What is the cut-off used in the NCATS data you chose as testing set?
For the pairwise comparison using scatter plots, I was referring to comparing the probabilities obtained in each model against another model (for the same molecule, we would expect similar values if the models perform similary), so could you run this comparison? You can build a matrix with all the plots maybe that will be the easiest to visualise

leilayesufu commented 9 months ago

For the NCATS data testoing set, based on an except from the article

Because it is hard to determine the hERG liability of a drug based on the IC50 value alone,41 without the knowledge of the peak serum concentration unbound to plasma proteins, we use both the activity thresholds (1 and 10 μM) to classify the drugs as blockers and non-blockers.

The scatter plots in the matrix below show the pairwise comparisons of probability values between different models for each molecule. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/scatter_plots.ipynb

GemmaTuron commented 9 months ago

Hi @leilayesufu !

Thanks for these plots. In order to see them better, it would be easier if they are square. If you collate your results in a single pandas, where each column corresponds to the results of one model, you can probably use the df.corr() to build a correlation matrix and then plot it using matplotlib automatically - like this here: https://stackoverflow.com/questions/57306089/how-to-plot-correlation-matrix-with-python-like-in-r-libraryperformanceanalyti

hope this is helpful!

leilayesufu commented 9 months ago

Good morning. I've successfully completed the task of comparing the models and generating a correlation matrix. You can find the notebook with the analysis here.

I utilized pandas to concatenate the results of each model into a single DataFrame and then calculated the correlation matrix using df.corr(). The correlation matrix is visualized as a heatmap using matplotlib for ease of interpretation.

GemmaTuron commented 9 months ago

Hi @leilayesufu

That is a good start, but I was referring to a correlation matrix with scatter plots, like shown in the link I shared above. If you are unsure about how to build it, I am leaving here a link to a video tutorial as well: https://www.google.com/search?q=correlation+matrix+with+scatter+plots+python&sca_esv=596768218&ei=oAOdZazOKtOmkdUPod-R2Ak&ved=0ahUKEwisnuvE8c-DAxVTU6QEHaFvBJsQ4dUDCBA&uact=5&oq=correlation+matrix+with+scatter+plots+python&gs_lp=Egxnd3Mtd2l6LXNlcnAiLGNvcnJlbGF0aW9uIG1hdHJpeCB3aXRoIHNjYXR0ZXIgcGxvdHMgcHl0aG9uMgYQABgWGB4yCxAAGIAEGIoFGIYDMgsQABiABBiKBRiGAzILEAAYgAQYigUYhgMyCxAAGIAEGIoFGIYDSK0MUPcDWLIKcAF4AZABAJgBpAGgAdkGqgEDMS42uAEDyAEA-AEBwgIKEAAYRxjWBBiwA-IDBBgAIEGIBgGQBgg&sclient=gws-wiz-serp#fpstate=ive&vld=cid:a588ad4d,vid:oNiURDHh3uk,st:27

leilayesufu commented 8 months ago

Thank you @GemmaTuron , it has been updated

GemmaTuron commented 8 months ago

Thanks @leilayesufu

What would be your conclusions based on this? I think you can also start working on cleaning up the code and adding it to the model-validations repository - make sure to create a /data folder and point all the notebook paths to that folder. Thanks!

leilayesufu commented 8 months ago

it is observed that a stronger percentage of overlap in their predicted probabilities corresponds to a higher correlation matrix. This is the link to the PR: https://github.com/ersilia-os/model-validations/pull/1

GemmaTuron commented 8 months ago

Hi @leilayesufu

I've merged the PR and together with @DhanshreeA we will do some tweaks to the code so you can follow next time - hope this is helpful to learn. For the conclusions: what I'd like to know is, according to you, are the models performing similarly? Does it has anything to do with the datasets that were used to train the models?

GemmaTuron commented 8 months ago

Hi @leilayesufu

Revising the code, why there is no information for dataset eos30gr? It is available even in the model repo.

GemmaTuron commented 8 months ago

Hi @leilayesufu

I have been doing some work in the repository. The most important is that ALL manipulations done on datasets are stored somewhere. Meaning, never ever manipulate a dataset directly in excel for example and upload the manipulated file, as it will be impossible to reproduce. So, I'd need you to upload the original training set files you got from each model in hERG/data/model_datasets and process them one by one in the 00_data_processing.ipynb notebook. We want to have a dataset per each model that contains only smiles, inchikey, activity as columns. I have left examples using the data from model eos30gr and from the NCATS data, and there is an empty cell with a placeholder for each training set.

Please start there and we will continue from it once this is done

leilayesufu commented 8 months ago

Good day. @GemmaTuron I've completed the initial data processing tasks for the training sets, as per your instructions. However, I observed that some datasets are missing the "activity" column. I've noted those instances for further investigation.

such as the pretraining datasets of eos4tcc: https://github.com/GIST-CSBL/BayeshERG/tree/main/data/Pretraining and the datasets of eos43at. :https://github.com/leilayesufu/model-validations/blob/main/hERG/data/model_datasets/eos43at/CHEMBL3039491.csv

GemmaTuron commented 8 months ago

Hi @leilayesufu Thanks for the work. Some comments to continue: What we want to have is the model datasets that have been used to train the model, so anything that is external validation (for example eos2ta5) we don't care about as the model will not have seen it. For model eos4tcc, there is a two-step training: first, the model is pre-trained with a general smiles list (without any associated activity) and then fine tuned with hERG data - so we only want the test, train, val of the fintuned data folder. Where is the dataset for eos30f3 coming from? the "Cai_TableS3_fixed.csv" I cannot find it in the original model repo For eos43at, where do you get the data from? I think you are mixing different assays. In the original publication what I find reported is: hERG Potassium Channel Inhibition. hERG inhibition is associated with the prolongation of the cardiac QT interval, which may lead to cardiac conditions such as arrhythmia.57,58 For this endpoint, data compiled by Sato et al. was used, among which 6993 compounds with reported activity (IC50 values) in the nanomolar range were selected. IC50 values were transformed into the pIC50 scale for numerical stability during model training. So the dataset should be taken from this publication: Sato, T.; Yuki, H.; Ogura, K.; Honma, T. Construction of an Integrated Database for hERG Blocking Small Molecules. PloS One 2018, 13, No. e0199348.

I hope this helps, you did a good job to start with, so let's continue on this line!

leilayesufu commented 8 months ago

Hi @GemmaTuron

For the eos30f3.

This model leverages the ChemProp network (D-MPNN, see original Stokes et al, Cell, 2020 for more information) to build a predictor of hERG-mediated cardiotoxicity. The model has been trained using a dataset published by Cai et al, J Chem Inf Model, 2019, which contains 7889 molecules with several cut-offs for hERG blocking activity. The authors select a 10 uM cut-off. This implementation of the model does not use any specific featurizer, though the authors suggest the moe206 descriptors (closed-source) improve performance even further.

This is the location in the original model repo. https://github.com/AI-amateur/DMPNN-hERG/blob/main/Figure2/Cai_TableS3_fixed.csv

For the eos4eat, there are no datasets available in the repo. but i downloaded them using this command as show in the README.md

wget https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/501185/data.tar.gz -O molgrad/data.tar.gz
tar -xf molgrad/data.tar.gz -C molgrad/

Those are the files i downloaded and used

leilayesufu commented 8 months ago

Hello @GemmaTuron ,

I hope this message finds you well. Following my recent meeting with @DhanshreeA , I gained insights into the eos43at dataset. While attempting to download the dataset from their repository, I realized that it encompassed various assays, not solely hERG-related, as you had mentioned. Consequently, I referred to the publication by Sato et al. titled "Construction of an Integrated Database for hERG Blocking Small Molecules" (PloS One 2018, 13, No. e0199348) available here.

Although direct access to the datasets from the publication wasn't feasible, we have been directed to a valuable database: https://drugdesign.riken.jp/hERGdb/.

GemmaTuron commented 8 months ago

Hi @leilayesufu,

If you have a look at the publication and database, I don't see any link to actually download the data, and the authors of MolGrad specify they use only a subset of 6000 molecules but do not share how they select the subset. In light of this, let's proceed without the eos43at model Please make sure to fill in the placeholder cells in the data processing notebook (I've left cells with # comments to be completed) before moving onto the actual comparisons. Once we are happy with the dataset for testing, we will need to predict again the values for the test dataset with the different models (not including eos43at) and evaluate the results from there.

leilayesufu commented 8 months ago

Hi @GemmaTuron I have created a pull request.

GemmaTuron commented 8 months ago

Hi @leilayesufu

It looks good but I have noticed that for the comparison of test and train data you are comparing the datasets on the SMILES, not INCHIKEY column. Remember that smiles can be different yet represent the same molecule, so make sure to change that to inchikeys - with that dataset we can run the predictions again and update the data

Next would be to take the other notebooks and compress them in one, I think would suffice, with the comparison of model performances as a scatter plot., and we will be done with this!

leilayesufu commented 8 months ago

@GemmaTuron I've created a pull request with the changes

GemmaTuron commented 8 months ago

Hi @leilayesufu

Thanks, a few comments!

I was referring to this line of code: processed_test_dataset = test_dataset[~test_dataset['smiles'].isin(training_set['smiles'])] Here you should use the inchikey to compare, not the smiles.
The training set should go in the model_datasets folder, what is the diference with the eos_dataset?
The test_set and processed_test should go into test data.
Why is the eos4tcc giving such polarized outputs?

Can you summarise the results here? what are your thoughts ?

leilayesufu commented 8 months ago

Good morning @GemmaTuron I have updated and created a new PR using InChiKey to compare.

In the model datasets folder, the folder eos2ta5 contains the raw training datasets of the model before it is processed. eos2ta5_processed.csv is the processed dataset of eos2ta5 while training_set.csv is the combined training datasets of the models.

leilayesufu commented 8 months ago

@GemmaTuron Good morning, This is the predictions gotten from the eos4tcc prediction_eos4tcc.csv Upon further inspection, I noticed that was around the range of 0.1 and 0.8 and that is why the eos4tcc shape was giving that

I fetched the eos4tcc model again and tested it on a smiles dataset i have locally (not hERG) and i got the following results eos4tcc_test.csv

Those are the results based on the test dataset we use for the validation

leilayesufu commented 8 months ago

Good morning, i have started work on the dili model validation. I have processed the training data of the two dili models and divided them into smiles|inchikey|activity and gotten the training set. However while processing the test_data so i can use for predictions. The test data the tdc dili dataset has 473 of 475 inchikeys in common with the training set. so iam looking for other dili datasets we can use for testing. I have created a PR with the updates.

https://github.com/ersilia-os/model-validations/pull/8

miquelduranfrigola commented 8 months ago

Hey @leilayesufu - since this isse thread is related to hERG, can we move the DILI conversation elsewhere? Thanks!

GemmaTuron commented 8 months ago

I will actually close this issue

ersilia-os / ersilia

hERG model validation #924

Summary

Scope

Objective(s)

Team

Timeline

Documentation