Closed GemmaTuron closed 1 month ago
Hi @leilayesufu and @Richiio !
Super exciting news! Having completed a few model incorporations successfully, we are ready to move onto the next stage; a more data science oriented work to ensure model reproducibility. Please see project description. You'll need to work together on that project, which is also fantastic, and I'll help out as well as the other mentors! To start with, let's distribute the first tasks:
I know this is a very large project, the objective for the rest of your internship is just to set this going and get as far as we can, so that new contributors can take it up next.
I have prepared some structure in the repository already. This project will be team.driven, so it is important to abide by some general rules to avoid conflicts:
Hi @GemmaTuron,
Thank you for the clear instructions. I'll get started right away on reformatting the hERG validation. I'll begin by updating the README file to summarize our work, incorporating the following sections:
Models Used:
I'll list the models used in this comparison along with their repository links. Additionally, I'll describe the output format of each model and how we interpret it, whether it's probabilities, pIC50 values, etc. Data Acquisition:
I'll detail how we obtained the data for the hERG validation. Combined Data:
I'll explain what the combined data is and how we merged different datasets if applicable. Data Pre-processing:
I'll outline the steps taken for data pre-processing, including converting data to canonical smiles using InChI keys, the featurizing technique employed, and any visualizations performed.
After completing the README, I'll proceed to review each of the hERG publications to identify one specific example from each publication that we can reproduce for each model. Additionally, I'll work on the scripts in the src folder for any reusable tasks.
If I encounter any challenges or have questions during the process, I'll make sure to reach out for assistance.
Looking forward to collaborating with everyone on this project!
Leaving some fun reading material for both @leilayesufu and @Richiio here to dive deeper into model bias when you get a chance:
https://www.baeldung.com/cs/learning-curve-ml https://towardsdatascience.com/generalization-regularization-overfitting-bias-and-variance-in-machine-learning-aa942886b870
Thanks @GemmaTuron @miquelduranfrigola @DhanshreeA
For my task, I want to find out how the initial 800 which was gotten from this library (https://github.com/ersilia-os/groverfeat) was standardized. I have been able to process the first 800 molecules, but I want to inquire more on the 200 to be gotten from coconut. Were Salts removed? Did they standardize tautomeric forms? Did they normalize functional groups? Or was there a mix of both done to obtain the SMILES gotten in the end.
Or has standardization not been done on the first 800 and I am meant to decide on how to standardize it?
Hi @Richiio
Those were not standardised in any particular manner. The dataset is already on the model-validations under data and contains ~200k molecules.
@GemmaTuron @miquelduranfrigola @DhanshreeA
I have curated the dataset of 1000 rows containing 800 rows of data from (https://github.com/ersilia-os/groverfeat) and 200 rows of data from coconut. Compounds gotten from coconut were randomized and all within a molecular weight of 800. They were all processible by rdkit.
Random 800 from Grover sampled_smiles.csv
Random 200 from Coconut final_200.csv
The combined dataset is seen here Validation.csv
We have three columns, the SMILES, Canonical SMILES and their INCHIKeys
The notebook containing data preprocessing done on the data can be found here: https://colab.research.google.com/drive/1H7yVFSyIC5GnnwEiGtqzDnJLGBxKLNdm?usp=sharing
This would be uploaded to the model validation repo folder
Hi @Richiio
Good start, but please work on the repository, I created a notebook specifically for that task already (/notebooks/00_...), and added the reference_library in the /data folder, load it from there so we always have the source data in the same place. I did add the header to make it easier to read. So please, move the code from the Colab to the repo and once it is ready open a PR. Before doing so, I suggest to prepare the UMAP and PCA representations (functions of the umap package and sklearn package respectively) to see how the molecules are distributed. It would be interesting to paint the synthetic molecules and the natural products in different colours. Plenty of examples on the internet of how to do that, for example: https://blog.reverielabs.com/mapping-chemical-space-with-umap/ - give it a try and let us know !
@GemmaTuron Good morning. For the reproducibility in eos2ta5. The authors tested the model with 3 different test sets provided against other models. Here's the result we could try to reproduce. https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00541-z/tables/4
Link to the test sets. Test set-I Test set-II Test set-III
@GemmaTuron
I standardized the data using Morgan fingerprints and got a vector of the smiles which were used for the statistical analysis with Umap and PCA. The plot has been created and a pull request will be made before the end of today. I currently took my system for a check and would have access to it by the end of today.
Hi @Richiio I did not see the PR, let us know once it is ready @leilayesufu that is a good suggestion, let's see if we cna find the same for the other models
@GemmaTuron Good morning. For the reproducibility in eos2ta5. The authors tested the model with 3 different test sets provided against other models. Here's the result we could try to reproduce. https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00541-z/tables/4
Link to the test sets. Test set-I Test set-II Test set-III
@leilayesufu can you share the nuances of each of the test sets you have cited above? Eg if a set is has more negative instances, or more positive instances.
For the eos4tcc in the publication in page 7-8, the author compared the proposed bayeshERG model with other models using the external datasets provided here this could serve as the reproducibility for eos4tcc
@DhanshreeA in response to this. https://github.com/ersilia-os/ersilia/issues/966#issuecomment-1952277187 The figure below shows the classification of the test sets.
Hi @Richiio I have reviewed your PR: https://github.com/ersilia-os/model-validations/pull/10. Please resolve the issues present in your work, it should not take very long as they're mostly very small changes. Thanks.
Hi @leilayesufu , the publication link that you have provided is broken and I cannot view it. In any case, can you provide more insight into data summary statistics, including the comparison tests the authors have performed. Also, please do the same for the remaining two models as well. Thank you.
Hi @leilayesufu !
Hi @Richiio !
Agree with all except eos43at -- @leilayesufu this should be incorporated into the hERG testing, as is a hERG cardiotoxicity model - can you do that?
Sarima, I've re-set the workflows so that the models are updated and can be used - you can start by structuring the repository folder for /cytotox and the notebooks, data and so that you will use.
Hi @Richiio
eos4cxk - this is an anti covid model, so I would not classify it as cytotox even though it has some cytotoxicity models
The others, yes, let's go for them as well. I've set the updated workflows
Thanks @GemmaTuron I'll get started on the eight models.
Hi @Richiio
We first will do the model bias check, which means running the 1000 compounds against all models and simply producing scatter plots, we want to see if there is very extreme values, for example. For this we do not need the data. Next we will focus on model reproducibility - identify one example per model (those that have publications attached, eos3le9, no, for example) that we can reproduce, as Leila has done
Hi @Richiio !
Agree with all except eos43at -- @leilayesufu this should be incorporated into the hERG testing, as is a hERG cardiotoxicity model - can you do that?
Sarima, I've re-set the workflows so that the models are updated and can be used - you can start by structuring the repository folder for /cytotox and the notebooks, data and so that you will use.
@GemmaTuron eos43at was part of the hERG models that we used for validation but couldn't get the dataset for and we had to proceed without. I can add it for the model reproducibility though.
@GemmaTuron
Task update: I've been able to get predictions.csv file for five of the models, the remaining three, I had to install torch which has been given me a HTTPS Connection error, I would retry the installation at night when I would get a better connection and create a scatterplot of the eight models.
Hi @Richiio
Remember you can use Codespaces. That would solve the problems for downloading packages
@GemmaTuron
Task update: I've been able to get predictions.csv file for five of the models, the remaining three, I had to install torch which has been given me a HTTPS Connection error, I would retry the installation at night when I would get a better connection and create a scatterplot of the eight models.
Hi @Richiio could you help me understand why you needed to install torch? The models should be self contained and should not require additional dependency installation from the user's end (that's you). If it happened, we should flag this. Also could you mention which three models you had to do this for?
@DhanshreeA
That was a fault on my end. I was trying to reproduce the environment in my local as if they hadn't been incorporated which was a fault on my end. I could easily have ran the ersilia fetch command which was what I did on codespaces
Hi @Richiio I have merged your PR mainly because I do not want it to grow uncontrollably large, but there is a lot more work to be done.
Here are my comments and suggestions, please open a new PR for incorporating these.
Bringing @GemmaTuron's suggestions from the PR to here: https://github.com/ersilia-os/model-validations/pull/11#issuecomment-1958851879 @leilayesufu let's do the following as Gemma has suggested:
For eos7kpb, we have the following column names in the output:
pf_nf54 | pf_k1 | mtb | cho | hepg2 | clint_h | clint_m | clint_r | caco_2 | aq_sol | cyp2c9 | cyp2c19 | cyp3a4 | cyp2d6 | pf_nf54_norm | pf_k1_norm | mtb_norm | cho_norm | hepg2_norm | clint_h_norm | clint_m_norm | clint_r_norm | caco_2_norm | aq_sol_norm | cyp2c9_norm | cyp2c19_norm | cyp3a4_norm | cyp2d6_norm -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --I will close this issue and if we work on model-validations I suggest we continue the discussion in the appropriate repo
Summary
The Hub has grown fast in the last months, and as such we need to make sure the models are reproducible and give the expected results, as some small bugs might have happened inadvertently. In this project, we will use the /model-validations repository to make sure the models perform well. There are, at least, three levels of model performance that we can check for:
Objective(s)
This project will be done in several steps. The current objectives are:
Deliverables:
Documentation
No response