Closed GemmaTuron closed 8 months ago
Hi @leilayesufu !
Following today's discussion, and given that you worked on the last hERG model I think this can be a good continuation of the work. Have a look and let me know
Thank you @GemmaTuron I'll start by identifying the hERG models in the Ersilia Model Hub.
Hey @leilayesufu an easy way to go about doing that is to visit the Model Hub page on the Ersilia website https://www.ersilia.io/model-hub and click on the hERG tag. That shows all the hERG models currently incorporated within the hub. From what I can see now, there are five such models.
Great @leilayesufu
Steps now:
Let me know for point 1 how many datasets you can find
1) eos2ta5 Datasets used: Training Dataset: https://github.com/Abdulk084/CardioTox/blob/master/data/train_validation_cardio_tox_data.tar.xz The positively biased test set: https://github.com/Abdulk084/CardioTox/blob/master/data/external_test_set_pos.csv The negatively biased test set: https://github.com/Abdulk084/CardioTox/blob/master/data/external_test_set_neg.csv Relatively larger negatively biased test set: https://github.com/Abdulk084/CardioTox/blob/master/data/external_test_set_new.csv
2) eos4tcc Datasets used: Pretraining datasets: https://github.com/GIST-CSBL/BayeshERG/tree/main/data/Pretraining Finetuning Datasets: https://github.com/GIST-CSBL/BayeshERG/tree/main/data/Finetuning External datasets the model was validated upon: https://github.com/GIST-CSBL/BayeshERG/tree/main/data/External
3) eos30gr Datasets used: checkpoints unavailable Retrained
4) eos30f3 Datasets used: Training dataset: https://github.com/AI-amateur/DMPNN-hERG/tree/main/Figure2/[Cai_TableS3_fixed.csv](https://github.com/AI-amateur/DMPNN-hERG/blob/main/Figure2/Cai_TableS3_fixed.csv)
5) eos43at The dataset in the original model was available for download so i downloaded them here https://github.com/leilayesufu/log_files/tree/master/dataset
Hi @leilayesufu
That's great. Can you create a small repo with the following structure? /data /notebooks
In data you can add all the datasets (maybe use the eosID for each one so we know what are we talking about) and in notebooks open a jupyter notebook where you load and compare the different datasets, let's see if there are repeated smiles or not - remember to compare molecules using unique identifiers like inchikey Rdkit is the best package to convert SMILES
make sure to create a public repository so we can see it :)
Good day @GemmaTuron https://github.com/leilayesufu/hERG_Datasets/tree/main
I get some warnings when running the notebook though, I'm trying to rectify that
Good start @leilayesufu !
I suggest:
I've successfully done the first three points. https://github.com/leilayesufu/hERG_Datasets/tree/main
Great @leilayesufu can you give some numbers on the dataset comparison?
Good morning @GemmaTuron
I combined all the datasets so that each ersilia model has only one csv dataset as seen, i also added a column for their inchikeys.
Then i compared the datasets across all the models based on their InChIKeys, and printed the duplicate entries across all datasets. in this jupyter notebook. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/compare_datasets.ipynb
I then combined the 4 models datasets together and remove all the duplicates here. https://github.com/leilayesufu/hERG_Datasets/blob/main/data/combined_dataset_eos.csv
I used chembl and pubchem to create a test dataset. https://github.com/leilayesufu/hERG_Datasets/blob/main/data/test_dataset.csv
i compared the combined ersilia datasets and the test dataset from chembl and removed all the matching ones so we can have an unbiased test set a seen here. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/compare_eos_and%20test.ipynb
Our test set has no similar molecules with the combined ersilia dataset
Hi @leilayesufu Good work, please provide here a report of how many molecules were duplicated within datasets, as well as where did you get the training sets from precisely. Which files did you donwload from ChEMBL, and how many were redundant with our training sets? How many active and inactive do we have in our test set? Where they all binary results or you applied a threshold?
From the notebook in cell 2, it can be seen that the Number of molecules duplicated within datasets: 49319. The training sets were gotten from the original models github repository. This is tests datasets i downloaded from chembl DOWNLOAD-roXaq-GSJT-Xw3YZ_HzRIzek-9lQbvPnOrDa-suZDkM=.csv
Most didn't have an activity column, so i did some digging on hERG models on github and their datasets to get hERG models with activity. This model had hERG compounds and their activity so i merged the training_set.csv and validation_set.csv and compared it with the combined ersilia's dataset and removed the molecules that were already in the combined_eos_dataset https://github.com/ncats/herg-ml/tree/master/data/train_valid
Hi @leilayesufu !
Sorry, maybe I did not explain myself good enough. It would be interesting to know the overlap between the different models (for example, are all molecules from eos4tcc and eos43at repeated whereas none is repeated with eos2ta5?) This will help in understanding the results we get in the validation. Could you modify your code to obtain these numbers? The
I think the test dataset is a good find, thanks! Would also be interesting to have a more refined understanding of the data - there are not duplicates even in chembl data? Also, it would be helpful to know the proportion of active and inactive (0 and 1)
Once this is complete, it should be just a matter of getting the different model predictions and comparing the results between models
@GemmaTuron I've updated the code to do that. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/compare_datasets.ipynb
Overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos30f3/eos30f3.csv based on InChIKey: 7575 molecules
Overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos4tcc/eos4tcc.csv based on InChIKey: 13515 molecules
Overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos43at/eos43at.csv based on InChIKey: 2737 molecules
Overlap between ../data/eos30f3/eos30f3.csv and ../data/eos4tcc/eos4tcc.csv based on InChIKey: 8533 molecules
Overlap between ../data/eos30f3/eos30f3.csv and ../data/eos43at/eos43at.csv based on InChIKey: 2391 molecules
Overlap between ../data/eos4tcc/eos4tcc.csv and ../data/eos43at/eos43at.csv based on InChIKey: 5543 molecules
I also created a python script to check the number of 1s and 0s in the test dataset.
Number of 1s: 447
Number of 0s: 2582
Ratio of 1s: 14.76%
Ratio of 0s: 85.24%
perfect, thanks @leilayesufu ! Now let's see what model performances you get using our test dataset and the 4 models from Ersilia. I suggest saving the results in /data and then creating a notebook to evaluate the results (AUROC, and contingency tables, probably the best?)
Good morning @GemmaTuron I have done this so far. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/evaluate_model.ipynb
The test dataset contains 3029 compounds and it took a long time to process so i extracted 100 compounds with 50 1s and 50 0s to create a new test dataset.
https://github.com/leilayesufu/hERG_Datasets/blob/main/data/updated_test_dataset.csv
Hi @leilayesufu the next set of tasks that you would need to do before we conclude validations of hERG models and can move them into the destined repository:
Find and understand the experimental cutoffs used within each of the models that is part of this validation. Generally just reading the README within the model repository should suffice. You are looking for something called https://mscreen.lsi.umich.edu/mscreenwiki/index.php?title=What_is_pIC50_or_pAC50%3F (it could be a different metric as well, but I have generally seen this metric being used for determining the binary outcome within these hERG models.) This is something known as use of domain expertise in machine learning when you use some external knowledge to determine what a class label should be for a given sample.
Get percentage of overlap: Do this in two ways so that we are thorough: Get inchi keys common to ALL the models' datasets. For example if you have 100 molecules common between all the models (model a, model b, model c), and model a has 1500 molecules, model b has 1000 molecules, model c has 200 molecules in their training datasets respectively, then the percentage of overlap between all of three of them would be (100/1500, 100/1000, 100/200) respectively The other way I want you to do it is pairwise. For example percentage overlap between model a and model b datasets, model b and model c, and so on...
For the same set of test molecules (the test data that you have from pubmed), make a scatter plot of all the results you get from each of the models. You can look at this tutorial for doing that: https://realpython.com/visualizing-python-plt-scatter/
AFTER you are done with the three points above, you can move your analysis and your results in the repository that Gemma created and posted in the slack channel.
Let me know how it's going and if you have any questions.
Good afternoon.
For the eos2ta5 model, the cutoff for hERG blockade in the classifier was set at IC50 < 10 μM. In the case of eos30f3, the authors opted for a specific cutoff, establishing it at 10 μM for hERG blocking activity. Similarly, eos4tcc utilized an IC50 threshold of 10 μM for classification into positives and negatives. For eos30gr, the experimental cutoff used in the training dataset for the hERG model was 80% inhibition.
Regarding the overlap between models, the percentage overlap for each model with all others was calculated as follows:
Percentage overlap for ../data/eos2ta5/eos2ta5.csv with all models: 2.80%
Percentage overlap for ../data/eos30f3/eos30f3.csv with all models: 4.77%
Percentage overlap for ../data/eos4tcc/eos4tcc.csv with all models: 0.12%
Percentage overlap for ../data/eos43at/eos43at.csv with all models: 2.28%
Additionally, the percentage overlap between specific models was determined:
Percentage overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos30f3/eos30f3.csv: 89.61%
Percentage overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos4tcc/eos4tcc.csv: 87.00%
Percentage overlap between ../data/eos2ta5/eos2ta5.csv and ../data/eos43at/eos43at.csv: 6.22%
Percentage overlap between ../data/eos30f3/eos30f3.csv and ../data/eos4tcc/eos4tcc.csv: 90.05%
Percentage overlap between ../data/eos30f3/eos30f3.csv and ../data/eos43at/eos43at.csv: 5.43%
Percentage overlap between ../data/eos4tcc/eos4tcc.csv and ../data/eos43at/eos43at.csv: 9.41%
This can be seen in the notebook below: https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/compare_datasets.ipynb
3) Furthermore, scatter plots based on accuracy, precision, and F-1 score for the five different models have been created. These visualizations can be accessed in the notebook: https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/evaluate_model.ipynb
Hi @leilayesufu
That is helpful, thanks for the tips @DhanshreeA . The training sets of eos30f3, eos2ta5 and eos4tcc are practically the same, so hopefully the performances of these models should be the same as well. Cut offs are all at 10 uM for IC50 which is good.
For the NCATS data testoing set, based on an except from the article
Because it is hard to determine the hERG liability of a drug based on the IC50 value alone,41 without the knowledge of the peak serum concentration unbound to plasma proteins, we use both the activity thresholds (1 and 10 μM) to classify the drugs as blockers and non-blockers.
The scatter plots in the matrix below show the pairwise comparisons of probability values between different models for each molecule. https://github.com/leilayesufu/hERG_Datasets/blob/main/notebooks/scatter_plots.ipynb
Hi @leilayesufu !
Thanks for these plots. In order to see them better, it would be easier if they are square. If you collate your results in a single pandas, where each column corresponds to the results of one model, you can probably use the df.corr()
to build a correlation matrix and then plot it using matplotlib automatically - like this here: https://stackoverflow.com/questions/57306089/how-to-plot-correlation-matrix-with-python-like-in-r-libraryperformanceanalyti
hope this is helpful!
Good morning. I've successfully completed the task of comparing the models and generating a correlation matrix. You can find the notebook with the analysis here.
I utilized pandas to concatenate the results of each model into a single DataFrame and then calculated the correlation matrix using df.corr(). The correlation matrix is visualized as a heatmap using matplotlib for ease of interpretation.
Hi @leilayesufu
That is a good start, but I was referring to a correlation matrix with scatter plots, like shown in the link I shared above. If you are unsure about how to build it, I am leaving here a link to a video tutorial as well: https://www.google.com/search?q=correlation+matrix+with+scatter+plots+python&sca_esv=596768218&ei=oAOdZazOKtOmkdUPod-R2Ak&ved=0ahUKEwisnuvE8c-DAxVTU6QEHaFvBJsQ4dUDCBA&uact=5&oq=correlation+matrix+with+scatter+plots+python&gs_lp=Egxnd3Mtd2l6LXNlcnAiLGNvcnJlbGF0aW9uIG1hdHJpeCB3aXRoIHNjYXR0ZXIgcGxvdHMgcHl0aG9uMgYQABgWGB4yCxAAGIAEGIoFGIYDMgsQABiABBiKBRiGAzILEAAYgAQYigUYhgMyCxAAGIAEGIoFGIYDSK0MUPcDWLIKcAF4AZABAJgBpAGgAdkGqgEDMS42uAEDyAEA-AEBwgIKEAAYRxjWBBiwA-IDBBgAIEGIBgGQBgg&sclient=gws-wiz-serp#fpstate=ive&vld=cid:a588ad4d,vid:oNiURDHh3uk,st:27
Thank you @GemmaTuron , it has been updated
Thanks @leilayesufu
What would be your conclusions based on this? I think you can also start working on cleaning up the code and adding it to the model-validations repository - make sure to create a /data folder and point all the notebook paths to that folder. Thanks!
it is observed that a stronger percentage of overlap in their predicted probabilities corresponds to a higher correlation matrix. This is the link to the PR: https://github.com/ersilia-os/model-validations/pull/1
Hi @leilayesufu
I've merged the PR and together with @DhanshreeA we will do some tweaks to the code so you can follow next time - hope this is helpful to learn. For the conclusions: what I'd like to know is, according to you, are the models performing similarly? Does it has anything to do with the datasets that were used to train the models?
Hi @leilayesufu
Revising the code, why there is no information for dataset eos30gr? It is available even in the model repo.
Hi @leilayesufu
I have been doing some work in the repository. The most important is that ALL manipulations done on datasets are stored somewhere. Meaning, never ever manipulate a dataset directly in excel for example and upload the manipulated file, as it will be impossible to reproduce. So, I'd need you to upload the original training set files you got from each model in hERG/data/model_datasets and process them one by one in the 00_data_processing.ipynb notebook.
We want to have a dataset per each model that contains only smiles, inchikey, activity
as columns. I have left examples using the data from model eos30gr and from the NCATS data, and there is an empty cell with a placeholder for each training set.
Please start there and we will continue from it once this is done
Good day. @GemmaTuron I've completed the initial data processing tasks for the training sets, as per your instructions. However, I observed that some datasets are missing the "activity" column. I've noted those instances for further investigation.
such as the pretraining datasets of eos4tcc: https://github.com/GIST-CSBL/BayeshERG/tree/main/data/Pretraining and the datasets of eos43at. :https://github.com/leilayesufu/model-validations/blob/main/hERG/data/model_datasets/eos43at/CHEMBL3039491.csv
Hi @leilayesufu Thanks for the work. Some comments to continue: What we want to have is the model datasets that have been used to train the model, so anything that is external validation (for example eos2ta5) we don't care about as the model will not have seen it. For model eos4tcc, there is a two-step training: first, the model is pre-trained with a general smiles list (without any associated activity) and then fine tuned with hERG data - so we only want the test, train, val of the fintuned data folder. Where is the dataset for eos30f3 coming from? the "Cai_TableS3_fixed.csv" I cannot find it in the original model repo For eos43at, where do you get the data from? I think you are mixing different assays. In the original publication what I find reported is: hERG Potassium Channel Inhibition. hERG inhibition is associated with the prolongation of the cardiac QT interval, which may lead to cardiac conditions such as arrhythmia.57,58 For this endpoint, data compiled by Sato et al. was used, among which 6993 compounds with reported activity (IC50 values) in the nanomolar range were selected. IC50 values were transformed into the pIC50 scale for numerical stability during model training. So the dataset should be taken from this publication: Sato, T.; Yuki, H.; Ogura, K.; Honma, T. Construction of an Integrated Database for hERG Blocking Small Molecules. PloS One 2018, 13, No. e0199348.
I hope this helps, you did a good job to start with, so let's continue on this line!
Hi @GemmaTuron
For the eos30f3.
This model leverages the ChemProp network (D-MPNN, see original Stokes et al, Cell, 2020 for more information) to build a predictor of hERG-mediated cardiotoxicity. The model has been trained using a dataset published by Cai et al, J Chem Inf Model, 2019, which contains 7889 molecules with several cut-offs for hERG blocking activity. The authors select a 10 uM cut-off. This implementation of the model does not use any specific featurizer, though the authors suggest the moe206 descriptors (closed-source) improve performance even further.
This is the location in the original model repo. https://github.com/AI-amateur/DMPNN-hERG/blob/main/Figure2/Cai_TableS3_fixed.csv
For the eos4eat, there are no datasets available in the repo. but i downloaded them using this command as show in the README.md
wget https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/501185/data.tar.gz -O molgrad/data.tar.gz
tar -xf molgrad/data.tar.gz -C molgrad/
Those are the files i downloaded and used
Hello @GemmaTuron ,
I hope this message finds you well. Following my recent meeting with @DhanshreeA , I gained insights into the eos43at dataset. While attempting to download the dataset from their repository, I realized that it encompassed various assays, not solely hERG-related, as you had mentioned. Consequently, I referred to the publication by Sato et al. titled "Construction of an Integrated Database for hERG Blocking Small Molecules" (PloS One 2018, 13, No. e0199348) available here.
Although direct access to the datasets from the publication wasn't feasible, we have been directed to a valuable database: https://drugdesign.riken.jp/hERGdb/.
Hi @leilayesufu,
If you have a look at the publication and database, I don't see any link to actually download the data, and the authors of MolGrad specify they use only a subset of 6000 molecules but do not share how they select the subset. In light of this, let's proceed without the eos43at model Please make sure to fill in the placeholder cells in the data processing notebook (I've left cells with # comments to be completed) before moving onto the actual comparisons. Once we are happy with the dataset for testing, we will need to predict again the values for the test dataset with the different models (not including eos43at) and evaluate the results from there.
Hi @GemmaTuron I have created a pull request.
Hi @leilayesufu
It looks good but I have noticed that for the comparison of test and train data you are comparing the datasets on the SMILES, not INCHIKEY column. Remember that smiles can be different yet represent the same molecule, so make sure to change that to inchikeys - with that dataset we can run the predictions again and update the data
Next would be to take the other notebooks and compress them in one, I think would suffice, with the comparison of model performances as a scatter plot., and we will be done with this!
@GemmaTuron I've created a pull request with the changes
Hi @leilayesufu
Thanks, a few comments!
processed_test_dataset = test_dataset[~test_dataset['smiles'].isin(training_set['smiles'])]
Here you should use the inchikey to compare, not the smiles.Can you summarise the results here? what are your thoughts ?
Good morning @GemmaTuron I have updated and created a new PR using InChiKey to compare.
In the model datasets folder, the folder eos2ta5 contains the raw training datasets of the model before it is processed. eos2ta5_processed.csv is the processed dataset of eos2ta5 while training_set.csv is the combined training datasets of the models.
@GemmaTuron Good morning, This is the predictions gotten from the eos4tcc prediction_eos4tcc.csv Upon further inspection, I noticed that was around the range of 0.1 and 0.8 and that is why the eos4tcc shape was giving that
I fetched the eos4tcc model again and tested it on a smiles dataset i have locally (not hERG) and i got the following results eos4tcc_test.csv
Those are the results based on the test dataset we use for the validation
Good morning, i have started work on the dili model validation. I have processed the training data of the two dili models and divided them into smiles|inchikey|activity and gotten the training set. However while processing the test_data so i can use for predictions. The test data the tdc dili dataset has 473 of 475 inchikeys in common with the training set. so iam looking for other dili datasets we can use for testing. I have created a PR with the updates.
Hey @leilayesufu - since this isse thread is related to hERG, can we move the DILI conversation elsewhere? Thanks!
I will actually close this issue
Summary
The Ersilia Model Hub features several hERG models. We'd like to compare its performances using an unbiased dataset obtained from the literature
Scope
Batch 🐕
Objective(s)
Team
Timeline
When the open model tasks are completed
Documentation
No response