Closed KiitanTheAnalyst closed 7 months ago
MOTIVATION STATEMENT
My journey into machine learning started last year and in the last 6 months I have had the opportunity to practice with different datasets using python and a couple of models in machine learning to make predictions. Knowing that I can build models to make accurate predictions in any sector is mind blowing for me and this has built my Interest in Data Science. .
The moment I saw the requirements to join the ersilia community for this contribution phase, there was no second guessing on joining this community because, I had looked forward to building my skills in this specific field and the tasks in this phase has furthered opened me to the possibilities of using machine learning and AI for drug discovery, which I find fascinating. I strongly believe that applying my skills in python will give me an opportunity to make meaningful contributions to the health sector.
As someone who lost her mom to breast cancer due to late detection and no guaranteed medicine for cure, I can say that I have a first hand experience of the pain that individuals with diseases like this go through. So, given the chance to intern in this amazing community who is committed to help lower the barriers of drug discovery and help with development of new medicine, This will give me the privilege to contribute to proffering solutions and possible cures to other people experiencing different kinds of diseases. With my resilience, dedication and ready to learn spirit, I believe being a part of the ersilia community as an Intern will help build my skills towards my aspirations to become a Data Scientist.
Hello @GemmaTuron Please, find the link to my repository to Task 1 in week 2 here for a review https://github.com/KiitanTheAnalyst/Ersilia---Olaitan-Suru/blob/main/Notebooks/00_Model_Bias.ipynb
Thank you
Hello @GemmaTuron @DhanshreeA Please, find the link to my repository to Task 2 in week 2 here for a review https://github.com/KiitanTheAnalyst/Ersilia---Olaitan-Suru/blob/main/Notebooks/01_Model_Reproducibility.ipynb
Thank you
Hi @KiitanTheAnalyst
In order to provide feedback please:
Once this is done, you can start working on the final application, thanks!
Thank you for this feedback, I will work on it Immediately @GemmaTuron
Step 1 - Model Selection Going through the list of models provided , I chose the eos6oli model because I understood the aim and the methodology of the publication.
Step 2 - Github Repository I created a repository in my GitHub with appropriate structure containing necessary files used and results generated.
Step 3 - Data Pre-Processing I explored the reference_library dataset containing 1000 rows of SMILES provided by the mentor. Then, I went ahead to standardise the SMILES using the codes provided in the src folder by importing it into my notebook by cloning my github repo. I generated Inchikeys with SMILES using codes which I also saved in smiles_processing.ipynb file in the src folder.
Step 4 - Model Predictions I Installed, fetched and served the Ersilia Model Hub to ensure it is working smoothly, I ran predictions for this sample dataset. I saved the result of the predictions in output folder in the data folder in my repo.
Step 5 - Model Bias Evaluation I plotted an histogram plot with predicted results which reveals that most compounds have solubility values concentrated around -5. I also generated morgan fingerprints with SMILES and used it to generate a scatterplot to display solubillity values with a threshold of 0.5. The distribution of colors around this plot indicates the Blue color as regions with prediction probability less than 0.5 which indicates low solubilty, and Red color as areas of high solubility with prediction probability equals or greater than 0.5
Step 6 - Interpretation of Result Model Information In the eos6oli publication, soluble compound is defined as a compound with log S > -4, i.e., being able to obtain a 100 μM solution. Author also suggests that the model is useful at screening out insoluble compounds.
I calculated the logarithm of solubility (base 10), and the result generated shows 998 compounds with Log S values of -10, and 2 compounds with Log S value > -4. I also checked if compounds are being able to obtain a 100 μM solution, and results shows that only the 2 compounds whose Log S is > -4 are able to obtain a 100 μM solution.
Step 7 - Conclusion Comparing the model information with predictions generated, it can be inferred that the model is not biased from the results generated . The model was able effectively screen out 998 compounds that has low solubility with Log S of -10 and inability to obtain 100 μM solution. Only 2 compounds has Log S values > -4 and are able to obtain 100 μM solution.
The link to my repository here
Step 1 - Installation of Model
I installed rdkit which is a requirement before installing the soltrannet model in my notebook
!pip install rdkit
!pip install soltrannet
Step 2 - Selecting Result to Reproduce I read the publication and saw a result I could reproduce using the SC2 dataset that was used in the study.
Step 3 - Running Predictions with Soltrannet Model From the publication, I was able to find the github repository which has the installation and usage instructions and data used. Using the SC2 dataset that I found in the github repository they provided in the publication, In section 1,I ensured the SMILES representation was valid using rdkit library. I made predictions with the soltrannet model, and saved the output. This output can be found in output folder created in the data folder in my repo.
Step 4 - Reproducing Figures To ascertain reproducibility, I compared the predictions generated the Soltranet model against the actual solubility values from Author's result and re-created an histogram plot and a bar plot of false discovery rates where insolubility is LogS <= -5 and <=-6 from the publication. The result was the same indicating reproducibility. I also went ahead to use a more categorized datasets also found on the soltrannet repository to generate false discovery rate of Insoluble compounds as seen on the publication and the output was also the same.
Step 5 - Reproducibility in Ersilia Model Hub Using the same SC2 dataset, in Section 2 of Model Reproducibility, I generated predictions using Ersilia model , saved the result and compared the results with Author's. I recreated an histogram and 2 plots for false discovery rates where the Insoluble LogS <= -5 and <=-6 . The results generated using Ersilia Model and the author's model was exactly the same which signifies Reproducibility.
The link to my repo here
Hello @GemmaTuron @DhanshreeA I have summarised findings here and have also updated my read me as advised. Kindly review, I will be glad to provide more explanations about my results if required.
Link to my README
Thank you.
Hello @KiitanTheAnalyst
Good job and very good explanations. Please move onto preparing your final application for Outreachy, as the mentors will only be focusing on these this last week of the contribution period.
Step 1 - Sourcing for Dataset with Sufficient Experimental Result After a short literature review, I found this publication with datasets on molecules and their experimentally determined solubility. I downloaded one for my use to be the external dataset in my validation process. I also downloaded the training dataset for the soltranet model and compared the SMILES with the ones in my external dataset to be sure there is no data leakage.
Step 2 - Cleaning and Standardizing Out of 9955 SMILES in my external data, I found 368 SMILES to be already a part of the training datasets and deleted them from my external datasets. I checked and confirmed that the SMILES were standardised and proceeded to save the processed dataset.
Step 3 - Ran Predictions and Calculated Metrics I made predictions on the training dataset using the eos6oli model and saved the output. Then ,I matched the ouput with the external dataset to get the experimental values and saved it as csv file. To validate the model, I calculated the R2 score (coefficient of determination) and the score is 0.51 suggesting that the model was able to capture a significant portion of the variation in experimental solubility.
Step 4 - PCA of Morgan Fingerprint I created a PCA plot of the training and external validation set using morgan fingerprints and the two sets of data overlap, which suggests how the model is able to generalize and perform well even on the external dataset.
Hello @DhanshreeA If it will not be bothersome. Please, I will like to get a review on my Task 3 on project validation while I work on my final applications. I hope to get pointers from you that could help me improve on how I validate a model in subsequent time.
I have summarised my findings in the comment above and in my README Link to my repo
Thank you for the awesome work since the contribution phase!
Week 1 - Get to know the community
Week 2 - Get Familiar with Machine Learning for Chemistry
Week 3 - Validate a Model in the Wild
Week 4 - Prepare your final application