✍️ Contribution period: Ishita Pathak

IshitaPathak commented 3 months ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[X] Install the Ersilia Model Hub and test the simplest model
[X] Install Docker if needed, and test another model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Get Familiar with Machine Learning for Chemistry

[x] Select a model from the list suggested in GitBook
[x] Download and serve the model via the Ersilia Model Hub to ensure it works
[x] Open a repository on your GitHub user with all the necessary files
[x] Select and clean a dataset of 1000 molecules (example notebook 1)
[x] Run predictions for the molecules on the selected model and evaluate the results

Week 3 - Validate a Model in the Wild

[x] Find a suitable dataset with sufficient experimental results
[x] Clean and standardize the dataset
[x] Run predictions and calculate metrics.

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

IshitaPathak commented 3 months ago

Motivation Letter

Hi, I am Ishita Pathak currently a first year student pursuing Master of Computer Application from Indira Gandhi Delhi Technical University For Women,Delhi, India. I am writing to express my genuine excitement about the opportunity to contribute to Ersilia's goals, to ensure that laboratories in less affluent countries have access to cutting-edge AI and ML tools for discovering drugs to treat infectious and neglected diseases.

As a computer science student, I have worked across various tech stacks. However, my current aspiration lies in delving deeper into AI/ML as ML is in my coursework too and Ersilia's project provides a chance to leverage my skills and knowledge to address real-world challenges. Being a quick learner, I'm ready to dedicate the time and effort needed to achieve these goals and learn new things along way.

Six years back, I went through a tough time when someone very close to me passed away because they couldn't get the medical help they needed in time. It really affected me and sparked a strong desire to make a difference in healthcare. I believe that contributing to Ersilia with my technical skills is the best way for me to do that. I am confident that I can contribute positively to advancing healthcare solutions and ultimately saving lives.

Why me? My passion for open source and never give up attitude sets me apart from others. I’ve always felt that working in open source and helping is my way of doing good for society but through this project, I’ll not only be able to give back to the community but also potentially save lives. I am excited about the opportunity to work on this project and will work as hard as I have to make this project a grand success.

Thanks and Regards Ishita Pathak

IshitaPathak commented 3 months ago

Week 1 TASK ✅

After Installation of Ersilia Model Hub I test it for simple model

ersilia -v fetch eos3b5e
ersilia serve eos3b5e
ersilia -v api run -i "CCCC"

Output

Testing Ersilia with Docker

docker pull ersiliaos/eos4wt0:latest
ersilia serve eos4wt0 
ersilia -v api run -i "CCCC"

Output

While completing the task I stuck at a point when I was testing ersillia model eos3b5e , where the container is always in exited status. I asked about this in Slack channel, where mentor helped me resolve the issue.

I truly appreciate the supportive environment within community, where both mentors and peers are always ready to lend a helping hand.

GemmaTuron commented 3 months ago

Hi @IshitaPathak Please update here w2 tasks that you have marked as done, so we can provide feedback

IshitaPathak commented 3 months ago

MY PROGRESS AND LEARNINGS

So far, I've learned valuable skills to contribute to Ersilia. It's been an exciting journey

Learned Docker by Dockerized a simple app GitHub repo here, learned about
- Dockerfile
- Caching layers
- Publishing to Docker Hub.
Explored Docker Compose, understanding port mapping and managing environment variables.

I have a strong foundation in Python, but my exposure to libraries was somewhat limited. To address this, I've invested some time in learning some libraries GitHub repo here like Pandas and NumPy. By today, I aim to complete my understanding of Matplotlib and other libraries essential for my current task. Following this, I move forward with the next part of Week 2 tasks.

IshitaPathak commented 3 months ago

Week 2 TASK ✅

Chose the hERG model "eos30gr" from the list of suggested models in GitBook
Read the publication to better understand the model. #
Model Overview

As hERG channel is responsible for regulating the electrical signals in the heart. When certain drugs block this channel, it can cause a condition known as long QT syndrome, which can lead to dangerous heart rhythm abnormalities.

To identify which drugs might have this effect, Ersilia developed a computer-based model called deephERG. This model uses a type of artificial intelligence called deep neural networks to analyze large datasets containing information on thousands of chemicals. By studying the chemical structures and properties of these compounds, deephERG can predict their likelihood of blocking the hERG channel.

#

Ensured model functionality on my system by downloading, serving, and running it using the following commands:

ersilia -v fetch eos30gr
ersilia serve eos30gr
ersilia -v api run -i "CCCC"

Upon fetching the eos30gr model, I encountered consistent null output for the smiles prediction. Since the models are regularly updated, I tried the command ersilia -v fetch eos30gr --from_github to fetch the latest code from GitHub, which resolved the issue seamlessly.

Output

#

Next I understood the repository structure from the provided example and created the GitHub Repository that has all necessary files.

GemmaTuron commented 3 months ago

Hi @IshitaPathak

Thanks for the explanation. I suggest the following timeline:

[ ] Finish week 2 tasks, including a good explanation of what you have done and your conclusions
[ ] Start working on your final application

As the application period is coming to an end and we want to ensure applicants have time to prepare strong applications please do not tackle Week 3 tasks and focus on the final application instead. Thanks!

IshitaPathak commented 3 months ago

Thankyou so much @GemmaTuron for the guidance and timeline. I'm committed to finishing the week 2 tasks and starting work on my final application right away.

IshitaPathak commented 3 months ago

Selected list of 1000 molecules reference_library.csv shared in Slack (data channel). To make sure the data was consistent, I standardized this SMILES representations using the function from src. For three SMILES, RDKit encounters invalid SMILES, resulting in NaN values. I removed those invalid entries from the dataset.
Next, I obtained the InChIKey representation for all the standardized SMILES. This information was used to create a DataFrame containing the processed SMILES and their corresponding InChIKeys. Now, this DataFrame had two columns: "smiles" and "InChI_key" I then saved this processed data as a csv file named processed_input.csv.

After cleaning the data and obtaining corresponding InChIKey, I ran the model on the processed dataset using following commands

ersilia -v fetch eos30gr --from_github
ersilia serve eos30gr
ersilia -v api run -i processed_input.csv -o output.csv

The output generated by the model is saved in the file output.csv

I use the predictions I got from the Ersilia Model Hub and create the necessary plots to see how are they distributed...

From the scatter plot we can say that due to significant overlap between the two classes, distinguishing between them becomes challenging. This overlap suggests that the features used for classification may not be distinct enough, impacting the model's ability to make accurate predictions and without a clear separation between the classes, the model may struggle to effectively differentiate between hERG blockers and non-blockers.

#

Completed week2 Task1 here is the link of notebook for this task 00_model_bias.ipynb

WEEK2 TASK2

Selected Table6 from this repo provided in the publication on page no. 32 where author have taken 1,824 FDA approved small molecule drugs from DrugBank database. After standardising the smilies, removing null and duplicates values.

I ran the model on the dataset using following commands


ersilia -v fetch eos30gr 
ersilia serve eos30gr
ersilia -v api run -i input_week2_task2.csv -o output_week2_task2.csv


* Then I compared the results of publication with those generated by the eos30gr model. The objective was to determine if both sources produce similar results.

<div style="display: flex; justify-content: space-around;">
    <div>
        <img src="https://github.com/ersilia-os/ersilia/assets/75848598/ae950928-eb9d-413e-a0be-f757c03dbac5" alt="LineChart_-vePredictiveProbability" width="358" />
        <img src="https://github.com/ersilia-os/ersilia/assets/75848598/42f76c5a-4ebb-4bfc-88a5-bb9e73f148a4" alt="BarChart_-vePredictiveProbability" width="402" />
    </div>
    <div>
        <img src="https://github.com/ersilia-os/ersilia/assets/75848598/d43a9847-c70a-4819-abf3-09b2e7bd6295" alt="LineChart_+vePredictiveProbability" width="358" />
        <img src="https://github.com/ersilia-os/ersilia/assets/75848598/cfd3d6e1-9ad5-46b0-aa13-17b49e832260" alt="BarChart_+vePredictiveProbability" width="402" />
    </div>
</div>

From the above graphs, it's very clear that there's a difference between the results obtained from the publication and those from the Ersilia Model Hub. This inconsistency suggests that the eos30gr model may not be reproducible.

Percentage of hERG Blockers and Non-Blockers in Publication Result:

| Blockers               | Number | Percentage |
|------------------------|--------|------------|
| Yes (Herg Blockers)    | 513    | 29.79%     |
| No (Non-Blockers)      | 1209   | 70.21%     |

Percentage of hERG Blockers and Non-Blockers After Testing from the Model:

| Blockers               | Number | Percentage |
|------------------------|--------|------------|
| Yes (Herg Blockers)    | 411    | 23.87%     |
| No (Non-Blockers)      | 1311   | 76.13%     |

From these percentages also, it's evident that there is a discrepancy between the percentage of hERG blockers and non-blockers in the publication results compared to those obtained from testing the model. This suggests potential issues with the reproducibility of the model. Hence model `eos30gr` is not reproducible.

Here is the link for [GitHub repository](https://github.com/IshitaPathak/model-validation-eos30gr/tree/master)
# 

## WEEK3 TASK
Selected a suitable dataset with sufficient experimental results, named [external_dataset_Xaio_Li.csv](https://github.com/IshitaPathak/model-validation-eos30gr/blob/master/data/external_dataset_Xiao_Li.csv) in data folder.

Here is the [reference of the data](https://weilab.math.msu.edu/DataLibrary/2D/#ref9) , I have taken Li 1092 test data
####
![Screenshot 2024-04-03 003723](https://github.com/IshitaPathak/model-validation-eos30gr/assets/75848598/84078cdd-2df6-47ba-9448-3b8e1d163952)

GemmaTuron commented 3 months ago

Hi @IshitaPathak

Thanks for the explanations, much celarer now, and good job on doing a PCA as well! Please move onto preparing your final application, many thanks!

IshitaPathak commented 3 months ago

Thankyou soo much @GemmaTuron. I really appreciate your time and feedback. Started working on final application.

IshitaPathak commented 3 months ago

WEEK 4 TASK ✅

Created final application and received feedback from mentor.
Submitted the final application on the Outreachy website.

ersilia-os / ersilia