Closed Hamidatmohd closed 7 months ago
MOTIVATION STATEMENT My name is Hamidat Mohammed. I am an aspiring Data scientist and AI enthusiast . I am skilled in Python and Machine learning. I have always been passionate about contributing to open source and this passion led me to apply for the outreachy project. After my initial application got approved, I went through all the available projects and this one caught my attention. I am not from a science background but I have always wanted to work in the healthcare industry. This is what motivated to apply for Ersilia after reading about what they do. This will really help my career in the sense that I will be able to learn more about machine learning and also how to apply to chemistry, science and healthcare. I am really excited about this opportunity. During the internship, I plan to dedicate my time to learning, contributing, building and growing with the Ersilia community. And after the internship I will like switch to working in the healthcare industry especially in my country Nigeria which is a developing country and contributing to how ML and AI can help to improve healthcare. Thank you so much for this opportunity.
Hello @GemmaTuron and @DhanshreeA Please find the link attached to my repo for task 1 week 2 for your review. Thank you so much
https://github.com/Hamidatmohd/Ersilia-Model-Evaluation/blob/main/notebooks/00_model_bias.ipynb
Hello @GemmaTuron and @DhanshreeA Please find the link attached to my repo for task 2 week 2 for your review. Thank you so much
Hi @Hamidatmohd
Please in order to provide feedback explain what have you done and the conclusions you got instead of just pasting a link here. Once this is explained and we can have a look, you can start working on your final application, thanks!
Hello @GemmaTuron Thank you so much, I will do just that.
WEEK 2 Model Selection : eos6oli
Model Description: This model predicts Acqeous Solubility of compounds, an important property for drug discovery using a molecule's SMILES representation as input.
Task 1: Accessing the Model eos6oli bias
STEPS TAKEN FOR TASK ONE
Model Selection and Github Repository After reviewing the provided list of models, I opted for the eos6oli model due to my comprehensive understanding of the publication's objectives and methodology. I created a repository on my GitHub platform organizing it with the requisite structure containing all necessary files and the resulting outcomes.
Data Pre-Processing I undertook data preprocessing steps, which involved standardizing the SMILES representations using the RDKit library and using the code provided within the src library, ensuring data integrity by identifying and removing any null values, and then extracting the InChI keys.
Model Predictions Following this, I proceeded to install, fetch, and serve the model. Next, I executed the dataset containing 1000 molecules through the Ersilia eos6oli model, yielding an output and I transformed the output into a DataFrame for further analysis.
Model Evaluation I plotted the predicted results using a histogram. it became apparent that the majority of compounds exhibited solubility values concentrated around -5. Furthermore, I utilized Morgan fingerprints derived from SMILES to generate a scatterplot illustrating solubility values against a threshold of 0.5. This visualization differentiated regions by color, with blue indicating areas of low solubility (prediction probability < 0.5) and red signifying regions of high solubility (prediction probability ≥ 0.5).
Result Interpretation In the mdoel's publication, a soluble compound is defined as having a log S > -4, indicating its capability to yield a 100 μM solution. The author also suggests the model's efficiency in screening out insoluble compounds. Upon calculation of the logarithm of solubility (base 10), the analysis revealed 998 compounds with Log S values of -10 and only 2 compounds with Log S values > -4. Moreover, an examination of whether compounds could achieve a 100 μM solution demonstrated that solely the 2 compounds with Log S > -4 met this criterion.
Conclusion Comparing the model information with the predictions generated, it can be deduced that the model exhibits no bias based on the outcomes obtained. The model effectively identified 998 compounds with low solubility (Log S of -10) and only 2 compounds possessed Log S values > -4 affirming the model's impartiality.
Task 2: Model Reproduciblity The link to the Model publication can be found here Link The Molecule Datasets used in the Bias task were obtained from the author github repository. The Github Repo that was created for this Task can be found here The Notebook for the Model Reproducibility can be found here. and it contains further analysis and step taken to complete the task.
STEPS TAKEN FOR TASK TWO
Reviewing the publication and selecting results to re produce I reviewed the publication and identified a result that I could replicate utilizing the SC2 dataset referenced in the study.
Installation of the model Upon locating the GitHub repository referenced in the publication, I followed the installation and usage instructions outlined in Model_Reproducibility. I proceeded by installing dependencies and the Python package associated with the model. Utilizing the SC2 dataset provided within the author model repository.
Running Predictions with Soltrannet Model and Figure Reproducilibity Subsequently, I conducted predictions with the model, preserving the output for analysis. To ensure reproducibility, I compared the model's predictions from the Soltranet model against the actual solubility values within the dataset. I then recreated a histogram observed in the publication, achieving identical results, thus confirming reproducibility. Additionally, I utilized categorized datasets available in the repository to generate sensitivity results, which aligned with those presented in the publication.
Reproducibility In Erisilia Hub For Model_Reproducibility, utilizing the same SC2 dataset, I employed the Ersilia model to generate predictions. After saving the results, I compared them with the dataset and replicated both the histogram and sensitivity results. Remarkably, the outcomes obtained using the Ersilia model mirrored those obtained with the author's model, demonstrating consistency and reproducibility.
Hi @Hamidatmohd
Thanks for the explanation, much clearer now! You did a good job, please go ahead and start working on your final application
Task 3: Model Validation The link to the Model publication can be found here Link The Molecule Datasets used in the validation task. The Notebook for the Model Reproducibility can be found here. and it contains further analysis and step taken to complete the task. STEPS TAKEN FOR TASK 3
Dataset Sourcing and Validation Upon conducting a brief literature review, I identified a publication containing datasets detailing molecular properties, including experimentally determined solubility. One dataset was selected as an external reference for validation purposes. Additionally, I acquired the training dataset for the soltranet model and meticulously compared its SMILES representations with those in the external dataset to ensure data integrity.
Data Cleaning and Standardization Among the 9955 SMILES entries in the external dataset, 368 were found to overlap with the training datasets and were consequently removed to prevent redundancy. Subsequently, I verified and standardized the remaining SMILES entries before saving the processed dataset.
Prediction and Metric Evaluation Predictions were generated for the training dataset using the eos6oli model, and the resultant output was saved. This output was then matched with corresponding entries in the external dataset to obtain experimental values, which were saved in a CSV file. Model validation was conducted by calculating the R2 score (coefficient of determination), yielding a score of 0.51, indicating substantial capture of variance in experimental solubility.
Principal Component Analysis (PCA) of Morgan Fingerprint A PCA plot was constructed using Morgan fingerprints for both the training and external validation datasets. The overlap observed between the two sets suggests the model's ability to generalize and perform effectively, even when applied to external datasets.
@DhanshreeA Please at your time and if it's too much bother, please find my GitHub repo and task 3 explanation attached above for review. I will keep working on my final application while I await a review. Thank you so much.
Week 1 - Get to know the community
Week 2 - Get Familiar with Machine Learning for Chemistry
Week 3 - Validate a Model in the Wild
Week 4 - Prepare your final application