Adhivp commented 7 months ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[X] Install the Ersilia Model Hub and test the simplest model
[X] Install Docker if needed, and test another model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Get Familiar with Machine Learning for Chemistry

[x] Select a model from the list suggested in GitBook
[x] Download and serve the model via the Ersilia Model Hub to ensure it works
[x] Open a repository on your GitHub user with all the necessary files
[x] Select and clean a dataset of 1000 molecules (example notebook 1)
[x] Run predictions for the molecules on the selected model and evaluate the results

Week 3 - Validate a Model in the Wild

[x] Find a suitable dataset with sufficient experimental results
[x] Clean and standardize the dataset
[x] Run predictions and calculate metrics.

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

Adhivp commented 6 months ago

Successfully Fetched the first simple model

Adhivp commented 6 months ago

Docker is successfully installed and docker pull also succesfully worked , sucessfully served the model after the docker pull , eos30gr .

I use Mac M1 which is arm based , some models are not supported here sad to here that.

Ran 2nd model - eos30gr successfully and here is the result

Adhivp commented 6 months ago

Motivation letter

Hi , my name is Adhithyan vp , I am a data science student from kerala,India. The motivation that helped me choose data science , will be kind of same for joining this program.

It was during my high school where i found my love/passion towards computers and tech, and when i started learning python my interest in tech grew huge. After that my old laptop became so slow that i couldn't use it, so with a suggestion from my friend I change my os from windows to Linux(Kubuntu)(recently bought my Mac M1 air). That's when I was first able to see this amazing world of open-source. I was really amazed by seeing people contributing to world-class software ,for free and maintaining this community . That's when i decided I will choose IT field as my career.

Then it came the most difficult part , choosing a field inside Tech, there were many options infront of me Cybersecurity, app developnment, web developnment, Data science/AI etc.. What i did was I started trying bit by bit of every technology , I started taking beginner hacking courses, I went to some Web3 hackathons and all . While i was trying each technologies , that's when i stumbled upon Dalle from OpenAI, chatgpt was not famous during that time it was just in it's early stage. The ability of Dalle to draw anything from scratch with just plain text , just amazed me . I was really amazed and decided to choose Data Science/AI/DL/ML as my career path.

Then I choose data science as a degree option for my college , then I went to college and start following my dreams. I started participating in many events, hackathons and detail of this can be found in my linkdein - https://www.linkedin.com/in/adhithyanvp/. I worked in some open source projects and it was all software python based. After that i really wanted to work on open-source and something ML based , Both ML and open-source these 2 criteria perfectly aligned with ersilia organisation. It also had clear documentation and guidelines on what to do and how to do. Also i found slack communtiy to be very friendly. that's why I choose ersilia.

To be honest i don't like or want to study chemistry , or be perfect in it. But my love for ML/ tech is so huge that i am willing to do the work. Ersilia model hub really inspires me as it has lot of models in it , and my mind wants to test all the models in it , I know it is not possible because of the time constraint. I really want to work on ersilia even after this outreachy contribution period. Please try to make it possible @DhanshreeA .

I hope i can do as much contributions for ersilia as possible. Looking forward for completing all the tasks. Thank you having the patience in reading my motiviation letter. Have a nice Day

Adhivp commented 6 months ago

Got the output successfully

Adhivp commented 6 months ago

Succesfully completed task_1 of model bais - https://github.com/Adhivp/Ersilia_Tasks here is the link

Adhivp commented 6 months ago

Output for reproducibility task

Adhivp commented 6 months ago

Completed the reproducibility tasks - https://github.com/Adhivp/Ersilia_Tasks @DhanshreeA Took table S7 from the dataset of original paper https://doi.org/10.1021/acs.jcim.8b00769

Was unable to reproduce the value of probability in the paper
Was able to reproduce 22 molecules as hREG blockers ,while the paper identified 49 molecules as hREG blocker
Check the notebook for deatiled analysis

Adhivp commented 6 months ago

@DhanshreeA Please give me your valuable feedback , so that I can improve if anything is wrong and also suggest me suggestions to find new dataset , so that i can move to next Week Thank you @DhanshreeA for your valuable time

GemmaTuron commented 6 months ago

Thanks @Adhivp We will provide feedback today and you can then proceed :)

DhanshreeA commented 6 months ago

Completed the reproducibility tasks - https://github.com/Adhivp/Ersilia_Tasks @DhanshreeA Took table S7 from the dataset of original paper https://doi.org/10.1021/acs.jcim.8b00769
* Was unable to reproduce the value of probability in the paper

* Was able to reproduce 22 molecules as hREG blockers ,while the paper identified 49 molecules as hREG blocker

* Check the notebook for deatiled analysis

Thank you for your work so far, good job! It appears that the model we have retrained may not have been trained correctly thus explaining the discrepancies in the results you have obtained vs the results in the paper.

Adhivp commented 6 months ago

ok thank you @DhanshreeA for considering the reproducibility problem, can I get guidance of what to do next?

Adhivp commented 6 months ago

I really wanted to do the 3rd task from the task list and even had the time to do so , because I respect @GemmaTuron words in Slack Channel , who said not to do , that's why I didn't start the task . As my both tasks were already finished without any additional changes needed, I decided to do one more dataset for the second task Table S6 , and also improve the tasks as much as I can.

Took table S6 from the dataset of original paper https://doi.org/10.1021/acs.jcim.8b00769

Adhivp commented 6 months ago

Then the model model eos30gr , started showing issues , it started giving me null outcomes , tried everything standardising, giving simple input,tried with other models and everything was working fine for other models.

Adhivp commented 6 months ago

I then searched the whole slack channel for issues and also github issues, finally in a thread @GemmaTuron told use fetch with --from_github tag, I even tried that still no result.

Adhivp commented 6 months ago

Instead of giving up , I used google collab then ran the model there , it took me whole 4 hours to get the output (because of a bug in code wasted another 4 hour). So total after 8 hours I got the output (don't worry I just set it on before sleep) and here are the results. Screenshot 2024-03-24 at 8 00 58 AM

Screenshot 2024-03-24 at 8 01 08 AM

Adhivp commented 6 months ago

Then I followed done the analysis as usual and here are the conclusions.

Values predicted doesn't match with values in the research paper
Values are entirely different from the paper the graph can be seen above
Considering a treshold greater than 0.5, 410 molecules have shown as a blocker and 1318 as non-blocker
In the original research paper Out of 1,728 is considered 526 postive and rest 1202 is negative
324 molecules match as blocker in both datasets
So probability values were not being able to reproduce
410 molecules are considered as blocker (324 is the real number as it gave many false positive)
More deatils with charts can be seen in this notebook (https://github.com/Adhivp/Ersilia_Contributions/blob/main/notebooks/eos30gr%20(main)/01_model_reproducibility(Table%20S6).ipynb)

GemmaTuron commented 6 months ago

Hi @Adhivp

Thanks for your conclusions, which are right as there is a slight mismatch between the results in the paper and the model used in the ersilia implementation that we are currently fixing.

As we are in the last week of the contribution period, please go ahead and start preparing your final application since mentors will only be reviewing those this week.

Adhivp commented 6 months ago

Thanks @GemmaTuron

Adhivp commented 6 months ago

As I was told not to do task3 and I had enough time , so I built and deployed a streamlit app highlighting my whole works for contributions. It provides unique features such as fully interactive graphs (which is not possible in jupyter notebook),easly navigate able interface etc... A full summary of what I have done , background research of the model and hERG gene. I took me some time to build this app, and had many issues while deploying the same , anyways after those hardships my hardwork is paid off , as I got a fully working app.

Adhivp commented 6 months ago

I tried my best to make the app visually appealing and also easy to get graphs for mentors or anybody using my app. Minor issues I faced during the app building can be understood from the commit messages of issue fixed in my original repo.

Adhivp commented 6 months ago

This is the link to my app - https://ersilia-contributions.onrender.com (It is hosted on a free service render that's why it rarely may show some lag)

This is the link to the subfolder of my repo with app files - https://github.com/Adhivp/Ersilia_Contributions/tree/main/streamlit_app

Adhivp commented 6 months ago

@DhanshreeA and @GemmaTuron Please review my final work before submitting final application. Please give me your valuable feedback , so that I can improve if anything is wrong, also your words are inspirations for me , which help me to do work on new innovative ideas like this.

Adhivp commented 6 months ago

The graphs are fully interactive , please feel free to play around with the graphs and also give me any suggestions to do in my app.

DhanshreeA commented 6 months ago

Hi @Adhivp what can I say, the app looks fun, I hope it was equally fun to build it. I am going to reiterate Gemma's words, please start working on your final application. You will not be penalized for not finishing task 3 due to delayed feedback.

Adhivp commented 6 months ago

Submitted the Final Application

Thank you for the review done by @DhanshreeA before submitting the application

Adhivp commented 6 months ago

Task 3

External dataset

https://www.nature.com/articles/s41598-019-47536-3#Sec18
filename - 41598_2019_47536_MOESM2_ESM.xlsx
Has 87,367 molecules , will use random 500 positive and 1000 negative for testing (total 1500)
After removing the common(to avoid lekage), the 1500 molecules becomes 1287 molecules , in which 360 are positive and 927 are negative
First 2967 are positive and rest all are negative in the large data set

Adhivp commented 6 months ago

Done the model evaluatin in google collab

Screenshot 2024-03-31 at 2 52 45 AM Screenshot 2024-03-21 at 6 53 54 PM

Took 3 hours to process 1287 molecules in google collab

Adhivp commented 6 months ago

Conclusion of Task3

Accuracy:
- The accuracy of the model is 74.90%, indicating that it correctly predicts the class labels for nearly three-quarters of the observations.
Sensitivity (True Positive Rate):
- The sensitivity of the model is 64.72%, indicating that it correctly identifies 64.72% of the actual positive cases.
Specificity (True Negative Rate):
- The specificity of the model is 78.86%, indicating that it correctly identifies 78.86% of the actual negative cases.
Precision (Positive Predictive Value):
- The precision of the model is 54.31%, indicating that when it predicts a positive case, it is correct 54.31% of the time.
Recall (Same as Sensitivity):
- The recall of the model is 64.72%, indicating the same as sensitivity.
Negative Predictive Value:
- The negative predictive value of the model is 85.20%, indicating that when it predicts a negative case, it is correct 85.20% of the time.
Balanced Accuracy:
- The balanced accuracy of the model is 71.79%, which is the average of sensitivity and specificity, providing a balanced view of the model's performance.
Matthew's Correlation Coefficient:
- The Matthew's correlation coefficient of the model is 0.41, indicating a moderate level of correlation between the predicted and true binary classifications.
F1 Score:
- The F1 score of the model is 59.06%, which is the harmonic mean of precision and recall, providing a balance between the two metrics.
AUROC (Area Under the Receiver Operating Characteristic Curve):
- The AUROC of the model is 71.79%, indicating the model's ability to distinguish between the positive and negative classes across various threshold values.
R2 Value:
- The R-squared value of the model is -0.25, which is negative, indicating that the model performs worse than a horizontal line (a horizontal line would have an R2 value of 0), suggesting that the model does not fit the data well in the context of regression analysis.

Adhivp commented 6 months ago

Completed task3 here is the link of the same - https://github.com/Adhivp/Ersilia_Contributions/blob/main/notebooks/eos30gr%20(main)/02_external_validation.ipynb

Adhivp commented 6 months ago

As per your availability, please review my last task @DhanshreeA @GemmaTuron

Adhivp commented 6 months ago

https://ersilia-contributions.onrender.com - Added Task 3 to my app Please feel free to check all graphs and tables as everything is made interactive and easy to use.

Adhivp commented 6 months ago

I am delighted to complete all my tasks, do extra works , make a interactive app to show my results. Thank you @DhanshreeA @GemmaTuron for your support . Also Big thanks to the community , as I could help many and get help from them.

This Journey is really memorable.

ersilia-os / ersilia

✍️ Contribution period: Adhithyan vp #1025

Week 1 - Get to know the community

Week 2 - Get Familiar with Machine Learning for Chemistry

Week 3 - Validate a Model in the Wild

Week 4 - Prepare your final application

Motivation letter

Submitted the Final Application

Task 3

External dataset

Done the model evaluatin in google collab

Took 3 hours to process 1287 molecules in google collab

Conclusion of Task3