ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
201 stars 128 forks source link

✍️ Contribution period: Saisri Vishwanath #1047

Closed saisri0102 closed 5 months ago

saisri0102 commented 6 months ago

Week 1 - Get to know the community

Week 2 - Get Familiar with Machine Learning for Chemistry

Week 3 - Validate a Model in the Wild

Week 4 - Prepare your final application

saisri0102 commented 6 months ago

Hi,

I am writing to express my strong interest in the internship opportunity with Ersilia. My background in machine learning, deep learning, and Python programming, along with my previous project experiences, make me an ideal candidate for this role.

During my previous projects, I have demonstrated my ability to work with state-of-the-art deep learning frameworks such as TensorFlow, PyTorch, and Hugging Face's Transformers library. In my recent role, I conducted a comprehensive benchmark of Hugging Face Generative Pre-trained Transformer models, evaluating their performance in generating text responses. This project not only honed my technical skills but also allowed me to develop a deep understanding of model evaluation methodologies and metrics.

Additionally, my experience in developing machine learning algorithms, such as the Grammar Error Corrector and the Brain Controlled Interface for Controlling Robotic Arm, has equipped me with a solid foundation in neural network architectures, attention mechanisms, and signal processing techniques. These skills will be invaluable in contributing to Ersilia's projects, especially in implementing advanced algorithms and models for natural language processing tasks.

Furthermore, I am deeply motivated by the opportunity to work on projects that have a meaningful impact on society. The Brain Controlled Interface project, in particular, allowed me to witness firsthand the transformative power of technology in improving the quality of life for individuals with disabilities. I am excited about the prospect of contributing to Ersilia's mission of developing innovative solutions that address real-world challenges.

Participating in the internship with Ersilia will not only advance my technical skills but also provide me with valuable industry experience and exposure to cutting-edge research and development practices. I am eager to collaborate with the talented team at Ersilia and contribute my expertise to meaningful projects.

Thank you for considering my application. I am enthusiastic about the opportunity to further discuss how my skills and experiences align with the goals of Ersilia. I am looking forward to the possibility of contributing to your team.

Sincerely, Saisri Vishwanath

saisri0102 commented 6 months ago

@DhanshreeA @Inyrkz
Week2 Tasks are uploaded to the below github repo: https://github.com/saisri0102/model-validation/tree/main/model-validation

Can you please review and let me know if I can start working on week3 tasks

saisri0102 commented 5 months ago

WEEK 2: Get Familiar with Machine Learning for Chemistry

Model Selected Prediction of hERG Channel Blockers with DMPNN - eos30f3

Model Eos30f3 Description. In drug discovery and medicine, there exists a component called the hERG potassium ion channel, responsible for regulating the flow of potassium ions essential for maintaining the heart's electrical activity. Certain substances can block this channel, leading to a condition known as hERG-mediated cardiotoxicity. A neural network model known as ChemProp, specifically the D-MPNN variant, has been developed to predict the cardiotoxicity potential of compounds by assessing their interaction with the hERG channel. This model was trained on a dataset comprising 7,889 molecules, with a concentration threshold of 10 uM.

Task 1 - Assessing Model Eos30f3 Bias. The 1,000 Molecule Datasets Used in the Bias task were downloaded from ChEMBL. The code can be found in the below github repo: https://github.com/saisri0102/model-validation/tree/main/model-validation

Task 2 - Model Eos30f3 Reproducibility

Identify Results you want to reproduce

According to the publication, diverse classification models were trained using a neural network called directed message passing neural network (D-MPNN) on various datasets collected from multiple sources in order to identify compounds that inhibit hERG. The model that performed the best was the D-MPNN + moe206, achieving an AUC-ROC value of 0.956 ± 0.005. However, it's worth noting that the molecular descriptor moe206, utilized in this model, is proprietary, so the model implemented on Ersilia was trained without a molecule featurizer. We aim to replicate the original D-MPNN model, which achieved an AUC-ROC value of 0.947 ± 0.005, using a 5-fold cross validation with random splitting. This model was trained on a dataset comprising 7889 compounds with well-defined experimental data on hERG, encompassing diverse chemical structures and featuring 6 thresholds (10 μM, 20 μM, 40 μM, 60 μM, 80 μM, and 100 μM) for distinguishing hERG blockers from non-blockers. The author selected a 10 μM threshold for the model. This dataset was curated by Cai et al. and published in J Chem Inf Model, 2019.

Implement the model on your system as described by the authors I cloned the model repository into my Ubuntu 22.4 system using git clone https://github.com/AI-amateur/DMPNN-hERG.git The code can be found in the below github repo: https://github.com/saisri0102/model-validation/tree/main/model-validation

DhanshreeA commented 5 months ago

Hi @saisri0102 good work so far! Glad to see Ersilia Compound Embeddings being used for 2D visualizations. Could you do the following and then move onto the final application?:

  1. Summarize in this issue, in a table, results from Task 1?
  2. Same as above, summarize reproducibility results for Task 2, including from your implementation of the paper, the EMH implementation, as well as the results published in the paper. Thank you.

I will review finally on Monday.

saisri0102 commented 5 months ago

Week 2 Task 1 Results:

I have used eos30f3 Model to make predictions on the chembl dataset.

The formulation of the problem involves using machine learning, specifically the ChemProp network (D-MPNN), to predict whether a molecule is a blocker of the hERG channel. The input to the model is a molecular structure represented in a format suitable for processing by the ChemProp network. This could include SMILES strings or other molecular representations. The output of the model is a prediction of the likelihood or probability that the molecule blocks the hERG channel.

Key Input Activity
HWGPBEQLDAATTP-UHFFFAOYSA-N N#CC1CCCN(C(=O)CCc2cccc(F)c2)C1 0.833848
VZEQMVMGOXXSDA-UHFFFAOYSA-N CCC(=O)c1cnc2ccc(-c3cc(Cl)c(O)c(OC)c3)cc2c1Nc1ccccc1Cl 0.737828
XPDWCQMOAYLTHH-CCVNUDIWSA-N C/C(=N\NC(=O)c1nc2c(c(=O)[nH]1)C1CCCN1C(=O)N2c1ccc(C(F)(F)F)cc1)N(C)C(C)C(C)C 0.596653
WWAFZFZKTQQHTL-ILRYNQFESA-N CC(=O)N[C@@H]1C@@HC@HC@@HO[C@H]1n1cnc2c(N)ncnc21 0.384757
BSKQAAYIGGYUAZ-VGOFMYFVSA-N Oc1[nH]c2ccccc2c1/C=N/c1nccs1 0.407854
... ... ...
UWRRUNGWGYWTLN-UHFFFAOYSA-N CC(C)C1CCC(N2CCC(N3c4ccccc4NS3(=O)=O)CC2)CC1 0.856711
AHOJWFAUNHFGRL-UHFFFAOYSA-N O=C(O)c1ccc2c(c1)N(C(=O)CNCc1ccc(F)cc1)CC(=O)N(C)C1CC1 0.686756
ZTJKDZJNWLALKP-UHFFFAOYSA-N COc1ccc2[nH]c(=O)c(-c3cc(C)cc(C)c3)c(OCC3CCCN(c4ncccc4C4CC4)CC3)c2c1 0.873780
AZLDEGGPCJQDTC-UHFFFAOYSA-N CCC(=O)Nc1cccc2c(OCC(O)C(C)NC(C)C)cccc12 0.784640
ZOQSXBXNFXEJQF-UHFFFAOYSA-N S=C1SCN(Cc2cccnc2)CN1Cc1cccnc1 0.791436
saisri0102 commented 5 months ago

Week 2 Task 2 Results:

Predictions generated by the D-MPNN+moe206 model for the test dataset during the first run of a 5-fold cross-validation: smiles class
C[C@@H]1CCC[NH+]1CCc2oc3ccc(cc3c2)c4cncc(c4)C#N 0.9947536
CCCCCCCN@@H+CCCCc1ccc(cc1)N+[O-] 0.9365226
CS(=O)(=O)Nc1ccc2OC3(CCNH+C4CCc5cc(ccc5)O4)CC2 0.9982115
Fc1ccc(cc1)n2cc(C3CCNH+CC3)c5ccncn25 0.9781892
Cc1ccc2c(cccc2n1)c3nnc(SCCC[NH+]4CCc5cc6[NH2+]c7ccccc7nc6cc5C4)c8ccc(cc83)C(=O)N 0.9908385
... ...
OC[C@H]1CCC[NH+]1CCCOc2ccc3c(Nc4cnn(CC(=O)Nc5ccc(F)cc5)c4=O)ccn3c2 0.06717938
Clc1ccc2c(c1)c(cn2C3CCCCC3)C4CCN+CC4 0.9655933
CC[C@H]1OC(=O)C@HC@@HC2(C)C)C@H[C@H]1OC 5.475458e-10
CNC@@H[C@@H]1CCN(C1)c2c(F)cc3C(=O)C(=CC(=O)N4CCNH+c5ccccc5)CC3c2 0.0004197311
COC(=O)[C@@H]1C@@HC[C@@H]2CC[C@H]1[N@H+]2C 0.6115026

Evaluation results of the D-MPNN+moe206 model:

col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9624590577 0.9451516602 0.9166666667 0.8839285714 0.9 0.9141156463 0.9137254902 134 13 9 99 0.9115646259 0.9166666667 0.9370629371 0.8246034551 0.8241820233
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9611992945 0.9408336261 0.8796296296 0.9134615385 0.8962264151 0.9092025699 0.9137254902 138 9 13 95 0.9387755102 0.8796296296 0.9139072848 0.8228747763 0.8224458792
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9518770471 0.9059122899 0.8796296296 0.8636363636 0.871559633 0.8887944067 0.8901960784 132 15 13 95 0.8979591837 0.8796296296 0.9103448276 0.7757829052 0.7756833176
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9524439405 0.8966710056 0.9166666667 0.8839285714 0.9 0.9141156463 0.9137254902 134 13 9 99 0.9115646259 0.9166666667 0.9370629371 0.8246034551 0.8241820233
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9601284958 0.9250141165 0.9074074074 0.875 0.8909090909 0.9060846561 0.9058823529 133 14 10 98 0.9047619048 0.9074074074 0.9300699301 0.8086118298 0.8081985709

Below are the predictions obtained from the Ersilia Model Hub implementation:

key input activity
UGELZTGBPPXJPE-OAHLLOKOSA-O C[C@@H]1CCC[NH+]1CCc2oc3ccc(cc3c2)c4cncc(c4)C#N 0.886925
YTYATOMQOOFRNA-UHFFFAOYSA-O CCCCCCCN@@H+CCCCc1ccc(cc1)N+[O-] 0.785938
NIYGLRKUBPNXQS-UHFFFAOYSA-O CS(=O)(=O)Nc1ccc2OC3(CCNH+C4CCc5cc(ccc5... 0.850671
UDRWVFGKMDCPTL-UHFFFAOYSA-O Fc1ccc(cc1)n2cc(C3CCNH+CC3)c5cc... 0.886333
JQEQULOLEPTBRS-UHFFFAOYSA-P Cc1ccc2c(cccc2n1)c3nnc(SCCC[NH+]4CCc5cc6[NH2+]... 0.774152
... ... ...
BQSSHYNPAGSOBT-LJQANCHMSA-O OC[C@H]1CCC[NH+]1CCCOc2ccc3c(Nc4cnn(CC(=O)Nc5c... 0.845857
CLQSZXFVERHVFU-UHFFFAOYSA-N Clc1ccc2c(c1)c(cn2C3CCCCC3)C4CC[N+](CCN5CCNC5=... 0.875263
PJVYTFDJHSYNLB-QNPWSHAKSA-N CC[C@H]1OC(=O)C@H[C@@H](O[C@H]2CC@@(... 0.208751
DCRAPCRZDJGSOF-PXAZEXFGSA-M CNC@@H[C@@H]1CCN(C1)c2c(F)cc3C(=O)C(=C... 0.390227
QIQNNBXHAYSQRY-KZVJFYERSA-O COC(=O)[C@@H]1C@@HC[C@@H]2CC[C@H]1[N@H+]2C 0.198635
Below are the author predicted values: smiles class
C[C@@H]1CCC[NH+]1CCc2oc3ccc(cc3c2)c4cncc(c4)C#N 0.9947536
CCCCCCCN@@H+CCCCc1ccc(cc1)N+[O-] 0.9365226
CS(=O)(=O)Nc1ccc2OC3(CCNH+C4CCc5cc(ccc5... 0.9982115
Fc1ccc(cc1)n2cc(C3CCNH+CC3)c5cc... 0.9781892
Cc1ccc2c(cccc2n1)c3nnc(SCCC[NH+]4CCc5cc6[NH2+]... 0.9908385
... ...
OC[C@H]1CCC[NH+]1CCCOc2ccc3c(Nc4cnn(CC(=O)Nc5c... 0.06717941
Clc1ccc2c(c1)c(cn2C3CCCCC3)C4CC[N+](CCN5CCNC5=... 0.9655933
CC[C@H]1OC(=O)C@H[C@@H](O[C@H]2CC@@(... 5.47548e-10
CNC@@H[C@@H]1CCN(C1)c2c(F)cc3C(=O)C(=C... 0.0004197315
COC(=O)[C@@H]1C@@HC[C@@H]2CC[C@H]1[N@H+]2C 0.6115029
Evaluation results for the author implementaion: col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9624590577 0.9451516602 0.9166666667 0.8839285714 0.9 0.9141156463 0.9137254902 134 13 9 99 0.9115646259 0.9166666667 0.9370629371 0.8246034551 0.8241820233
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9611992945 0.9408336261 0.8796296296 0.9134615385 0.8962264151 0.9092025699 0.9137254902 138 9 13 95 0.9387755102 0.8796296296 0.9139072848 0.8228747763 0.8224458792
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9518770471 0.9059122899 0.8796296296 0.8636363636 0.871559633 0.8887944067 0.8901960784 132 15 13 95 0.8979591837 0.8796296296 0.9103448276 0.7757829052 0.7756833176
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9524439405 0.8966710056 0.9166666667 0.8839285714 0.9 0.9141156463 0.9137254902 134 13 9 99 0.9115646259 0.9166666667 0.9370629371 0.8246034551 0.8241820233
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9601284958 0.9250141165 0.9074074074 0.875 0.8909090909 0.9060846561 0.9058823529 133 14 10 98 0.9047619048 0.9074074074 0.9300699301 0.8086118298 0.8081985709
saisri0102 commented 5 months ago

Week 2 Task 2 Results:

Predictions generated by the D-MPNN model for the test dataset during the first run of a 5-fold cross-validation:

Index SMILES Class
0 C[C@@H]1CCC[NH+]1CCc2oc3ccc(cc3c2)c4cncc(c4)C#N 0.985650
1 CCCCCCCN@@H+CCCCc1ccc(cc1)N+[O-] 0.086860
2 CS(=O)(=O)Nc1ccc2OC3(CCNH+C4CCc5cc(ccc5 0.995024
3 Fc1ccc(cc1)n2cc(C3CCNH+CC3)c5cc... 0.979510
4 Cc1ccc2c(cccc2n1)c3nnc(SCCC[NH+]4CCc5cc6[NH2+] 0.975559
... ... ...
250 OC[C@H]1CCC[NH+]1CCCOc2ccc3c(Nc4cnn(CC(=O)Nc5c... 0.746229
251 Clc1ccc2c(c1)c(cn2C3CCCCC3)C4CC[N+](CCN5CCNC5=... 0.921453
252 CC[C@H]1OC(=O)C@H[C@@H](O[C@H]2CC@@(... 0.000008
253 CNC@@H[C@@H]1CCN(C1)c2c(F)cc3C(=O)C(=C... 0.012800
254 COC(=O)[C@@H]1C@@HC[C@@H]2CC[C@H]1[N@H+]2C 0.448671

Evaluation results of the D-MPNN model:

col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9552784077 0.9369958198 0.9259259259 0.8403361345 0.8810572687 0.8983371126 0.8941176471 128 19 8 100 0.8707482993 0.9259259259 0.9411764706 0.7890569999 0.7860538827
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9520030234 0.9317286147 0.8981481481 0.8584070796 0.8778280543 0.8946523054 0.8941176471 131 16 11 97 0.8911564626 0.8981481481 0.9225352113 0.7851123174 0.7844868063
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9555303603 0.9344225018 0.8611111111 0.8942307692 0.8773584906 0.8931405896 0.8980392157 136 11 15 93 0.925170068 0.8611111111 0.9006622517 0.7905753739 0.7901633118
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9642227261 0.9510559147 0.8425925926 0.91 0.875 0.8906840514 0.8980392157 138 9 17 91 0.9387755102 0.8425925926 0.8903225806 0.7907885536 0.7891221374
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9503653313 0.9274934124 0.8796296296 0.8796296296 0.8796296296 0.8955971277 0.8980392157 134 13 13 95 0.9115646259 0.8796296296 0.9115646259 0.7911942555 0.7911942555

Below are the predictions obtained from the Ersilia Model Hub implementation:

key input activity
UGELZTGBPPXJPE-OAHLLOKOSA-O C[C@@h]1CCC[NH+]1CCc2oc3ccc(cc3c2)c4cncc(c4)C#N 0.886925
YTYATOMQOOFRNA-UHFFFAOYSA-O CCCCCCCN@@H+CCCCc1ccc(cc1)N+[O-] 0.785938
NIYGLRKUBPNXQS-UHFFFAOYSA-O CS(=O)(=O)Nc1ccc2OC3(CCNH+C4CCc5cc(ccc5... 0.850671
UDRWVFGKMDCPTL-UHFFFAOYSA-O Fc1ccc(cc1)n2cc(C3CCNH+CC3)c5cc... 0.886333
JQEQULOLEPTBRS-UHFFFAOYSA-P Cc1ccc2c(cccc2n1)c3nnc(SCCC[NH+]4CCc5cc6[NH2+]... 0.774152
... ... ...
BQSSHYNPAGSOBT-LJQANCHMSA-O OC[C@H]1CCC[NH+]1CCCOc2ccc3c(Nc4cnn(CC(=O)Nc5c... 0.845857
CLQSZXFVERHVFU-UHFFFAOYSA-N Clc1ccc2c(c1)c(cn2C3CCCCC3)C4CC[N+](CCN5CCNC5=... 0.875263
PJVYTFDJHSYNLB-QNPWSHAKSA-N CC[C@H]1OC(=O)C@H[C@@h](O[C@H]2CC@@(... 0.208751
DCRAPCRZDJGSOF-PXAZEXFGSA-M CNC@@H[C@@h]1CCN(C1)c2c(F)cc3C(=O)C(=C... 0.390227
QIQNNBXHAYSQRY-KZVJFYERSA-O COC(=O)[C@@h]1C@@HC[C@@h]2CC[C@H]1[N@H+]2C 0.198635

Below are the author predicted values:

Index SMILES Class
0 C[C@@H]1CCC[NH+]1CCc2oc3ccc(cc3c2)c4cncc(c4)C#N 0.985650
1 CCCCCCCN@@H+CCCCc1ccc(cc1)N+[O-] 0.086860
2 CS(=O)(=O)Nc1ccc2OC3(CCNH+C4CCc5cc(ccc5 0.995024
3 Fc1ccc(cc1)n2cc(C3CCNH+CC3)c5cc... 0.979510
4 Cc1ccc2c(cccc2n1)c3nnc(SCCC[NH+]4CCc5cc6[NH2+] 0.975559
... ... ...
250 OC[C@H]1CCC[NH+]1CCCOc2ccc3c(Nc4cnn(CC(=O)Nc5c... 0.746229
251 Clc1ccc2c(c1)c(cn2C3CCCCC3)C4CC[N+](CCN5CCNC5=... 0.921453
252 CC[C@H]1OC(=O)C@H[C@@H](O[C@H]2CC@@(... 0.000008
253 CNC@@H[C@@H]1CCN(C1)c2c(F)cc3C(=O)C(=C... 0.012800
254 COC(=O)[C@@H]1C@@HC[C@@H]2CC[C@H]1[N@H+]2C 0.448671

Evaluation results for the author implementaion:

col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9552784077 0.9369958198 0.9259259259 0.8403361345 0.8810572687 0.8983371126 0.8941176471 128 19 8 100 0.8707482993 0.9259259259 0.9411764706 0.7890569999 0.7860538827
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9520030234 0.9317286147 0.8981481481 0.8584070796 0.8778280543 0.8946523054 0.8941176471 131 16 11 97 0.8911564626 0.8981481481 0.9225352113 0.7851123174 0.7844868063
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9555303603 0.9344225018 0.8611111111 0.8942307692 0.8773584906 0.8931405896 0.8980392157 136 11 15 93 0.925170068 0.8611111111 0.9006622517 0.7905753739 0.7901633118
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9642227261 0.9510559147 0.8425925926 0.91 0.875 0.8906840514 0.8980392157 138 9 17 91 0.9387755102 0.8425925926 0.8903225806 0.7907885536 0.7891221374
col_names roc prc Recall Precision f1 BA accuracy TN FP FN TP SP SE NPV MCC cohen_kappa
class_class 0.9503653313 0.9274934124 0.8796296296 0.8796296296 0.8796296296 0.8955971277 0.8980392157 134 13 13 95 0.9115646259 0.8796296296 0.9115646259 0.7911942555 0.7911942555
saisri0102 commented 5 months ago

@DhanshreeA I have summarised the results from week1 and week2. Can you please confirm if I can go ahead and submit my final application?

DhanshreeA commented 5 months ago

Hi @saisri0102 looks great! Thanks for your efforts. Please go ahead and submit the final application.