ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: Taliqa Muhib #993

Closed Talikamuhib closed 3 months ago

Talikamuhib commented 4 months ago

Week 1 - Get to know the community

Week 2 - Get Familiar with Machine Learning for Chemistry

Week 3 - Validate a Model in the Wild

Week 4 - Prepare your final application

Week 1 - Get to know the community

Channels:

Package Plan

environment location: /home/taliqamuhib/miniconda3/envs/ersilia

added / updated specs:

The following NEW packages will be INSTALLED:

_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu ca-certificates pkgs/main/linux-64::ca-certificates-2023.12.12-h06a4308_0 certifi pkgs/main/linux-64::certifi-2022.12.7-py37h06a4308_0 ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.38-h1181459_1 libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_0 libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 openssl pkgs/main/linux-64::openssl-1.1.1w-h7f8727e_0 pip pkgs/main/linux-64::pip-22.3.1-py37h06a4308_0 python pkgs/main/linux-64::python-3.7.16-h7a1cb2a_0 readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 setuptools pkgs/main/linux-64::setuptools-65.6.3-py37h06a4308_0 sqlite pkgs/main/linux-64::sqlite-3.41.2-h5eee18b_0 tk pkgs/main/linux-64::tk-8.6.12-h1ccaba5_0 wheel pkgs/main/linux-64::wheel-0.38.4-py37h06a4308_0 xz pkgs/main/linux-64::xz-5.4.6-h5eee18b_0 zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_0

Proceed ([y]/n)? y

Downloading and Extracting Packages:

Preparing transaction: done Verifying transaction: done Executing transaction: done #

To activate this environment, use

#

$ conda activate ersilia

#

To deactivate an active environment, use

#

$ conda deactivate

- Installed `[Docker](https://docs.docker.com/desktop/wsl/)` to ensure seamless integration with Ersilia's environment.

(ersilia) taliqamuhib@Taliqa-Muhib:~$ pip install docker Collecting docker Using cached docker-6.1.3-py3-none-any.whl (148 kB) Collecting requests>=2.26.0 Using cached requests-2.31.0-py3-none-any.whl (62 kB) Collecting packaging>=14.0 Downloading packaging-24.0-py3-none-any.whl (53 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.5/53.5 kB 278.7 kB/s eta 0:00:00 Collecting urllib3>=1.26.0 Using cached urllib3-2.0.7-py3-none-any.whl (124 kB) Collecting websocket-client>=0.32.0 Using cached websocket_client-1.6.1-py3-none-any.whl (56 kB) Requirement already satisfied: certifi>=2017.4.17 in ./miniconda3/envs/ersilia/lib/python3.7/site-packages (from requests>=2.26.0->docker) (2022.12.7) Collecting idna<4,>=2.5 Using cached idna-3.6-py3-none-any.whl (61 kB) Collecting charset-normalizer<4,>=2 Using cached charset_normalizer-3.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (136 kB) Installing collected packages: websocket-client, urllib3, packaging, idna, charset-normalizer, requests, docker Successfully installed charset-normalizer-3.3.2 docker-6.1.3 idna-3.6 packaging-24.0 requests-2.31.0 urllib3-2.0.7 websocket-client-1.6.1 (ersilia) taliqamuhib@Taliqa-Muhib:~$ wsl.exe -l -v NAME STATE VERSION

# clone from github
git clone https://github.com/ersilia-os/ersilia.git
cd ersilia
# install with pip (use -e for developer mode)
pip install -e .
Talikamuhib commented 4 months ago

Hello World! 👋

ABOUT ME!

Hi! I am Taliqa Muhib, a Curly haired Pakistani girl from the Karakorum mountains who often codes, sings and write, and part time gamer. 💻

HOW I GOT INTO CS AND AI?

Being a girl, growing up in a rural area with limited resources, along the stereotypical discouraging thoughts about women education, It was always dream to expand my expertise to an international level. Breaking the conventions, I left my home and got admission in public university 500 miles away. During my bachelor's degree I got some online courses where I built my Machine learning basics and apply in my research where I reserved distinction. With that zeal, I am currently pursuing my master's in computer science, my fascination with AI and research has led me into the realms of machine learning, deep learning, and generative AI. I'm eager to apply these skills in real-time scenarios and I believe Outreachy is the best platform. 🌍💻📚

WHY ERSILIA IN OUTRECHY?

After I got email for selection of initial application, When I came across Ersilia's search for interns, it felt like the perfect opportunity to align my interests and expertise. Their focus on Open Source Artificial Intelligence for Neglected Diseases resonated deeply with me. Having experienced firsthand the challenges of limited access to healthcare services and the devastating impact of neglected diseases in rural areas, I felt a strong connection to their mission. 🤝🏥💡

When I was checking about @Ersilia, I watched video which truly struck a chord when @GemmaTuron, the CEO of Ersilia, mentioned the disparity in the cure of diseases between rural areas in developing countries and developed nations. This inequality in access to healthcare and treatment options deeply resonated with my own experiences growing up in a rural area with limited resources. 💔🌐

When i was in my last year of my bachelors, one of my friends experienced the unbearable pain of renal stones. I Witnessed her suffering, it was heartbreaking, especially knowing that she was misdiagnosed eating painkillers which eventually lead to stomach problems. This experience pushed me to research on the early detection of renal stones from CT images using vision transformers (NOVAL of that time). I have the passion to wake up and work on sleepy nights with coffee. My goal is to bring innovation in healthcare.

WHY ME?

My background and skills make me uniquely suited to contribute to Ersilia's work. I have real time experience in prompting and being an Ex ML engineer I have solid foundations of ML, DL and gen AI. By participating in Outreachy internship, I not only hope to expand my horizons but also to make a tangible difference in the lives of those affected by neglected diseases. I am driven by the belief that by combining technology and community-driven efforts, we can save lives and improve access to healthcare services for all.

Talikamuhib commented 4 months ago

Use model through CLI

My 1/2 work of WEEK 1 with @ersilia-os

Recap

I went through about @ersilia-os and checked the website - truly AMAZED with the great minds work there. After that went through the HANDBOOK which they provided as contribution guide. As per instructions. I downloaded Ubuntu and installed. After that, I Installed Ersilia and all the dependencies. Installed Docker as well. Along that I Opened this issue! where I shared why I really want to contribute and be part of Ersilia as Outrechy Intern.

Testing Ersilia with Docker

To test Ersilia with Docker, I installed docker for windows and funny thing happen, i was fetching model along the docker installation, in the end, there was option to restart, thought its just restart the program, but the laptop restarted! THE MOMENT I REALISED I .....

Talikamuhib commented 4 months ago

Week 2 - Get Familiar with Machine Learning for Chemistry

Recap

Before Starting Week 2 we had a meeting with Ersilia Team, they introduced them, and guided us how to contribute in the current contribution Period! They shared papers of the models with us to get our hand dirty on and get to know about ML for Chemistry.

Select a model from the list suggested in GitBook

this week tasks were pretty much interesting, have to select a model which kick your mind to dive more into it.

I selected Ersilia eso6oli - SolTransNet model of SolTranNet−A Machine Learning Tool for Fast Aqueous Solubility Prediction paper by Francoeur er al. - The reason is simple! JUST WANT TO KNOW SUGER SOLVES FASTER OR SALT! :). I really love to play with transformer based models.

SolTranNet, a molecule attention transformer MAT to predict aqueous solubility from a molecule's SMILESrepresentation. Actually its is Regression model, a Predicted LogS (log of the solubility) to filter out insoluble compounds! it is fined tuned with pertained model MAT, it apply self attention to a molecular graph representation of the molecule.

image

SolTranNet’s dependencies areRDKit17 (2017.09.1+), NumPy18 (1.19.3),PyTorch19(1.7.0+), and pathlib (1.0+). SolTranNet achieves a sensitivity of94.8% on the Predict Aqueous Solubility(SC2) data set and is competitive with the other methods submitted to the competition.

Data Sets. AqSolDB16 is the data set we utilized for training SolTranNet, as it was the largest publicly available set. ESOL data in orange in the given graph, was used while training the data model of MAT.

Screenshot 2024-03-10 214848

Ersilia model eso6oli Implementation.

download (2)

Here we can see that most of the compounds as log s < -4 in red almost 65% are insoluble in human body, - ! model biasness will be evaluated when i will run the Soltrannet and compared - FOR NOW I HAVE WORKED THIS -

References

Francoeur, P. G., & Koes, D. R. (2021). SolTranNet-A Machine Learning Tool for Fast Aqueous Solubility Prediction. Journal of chemical information and modeling, 61(6), 2530–2536. https://doi.org/10.1021/acs.jcim.1c00331

DhanshreeA commented 4 months ago

Hi @Talikamuhib good job so far! The reason the implementation in the Ersilia Model Hub does not generate predictions directly is because different users of the model might want to keep varying thresholds for binarizing the outcome.

Few comments on your work:

  1. In your repo could you create a figures folder and save the figures there?
  2. Perhaps you can try experimenting with different thresholds of solubility and comment on the results you obtain.
Talikamuhib commented 4 months ago

@DhanshreeA Thank you for the feedback! and that good job call literally 10X my motivation and initiated turbo mode to contribute. Thank you for clearing my ambiguity regarding models threshold!
I will implement your comments ASAP! I have to do load of work on it and try more channeled graphs and charts for better visualization.

PS. I want to get prediction on something which is totally new for model! as it is trained on load of data! i feel it has been already seen this! can you help me out in finding new compounds dataset of 2022-23-24 of smiles and its labels!

Talikamuhib commented 4 months ago

MY 3/4 Work of Week 2 with @Ersilia-os

Recap:

I have started to work on week 2, found some basic results and manage to fetch the selected model to docker! i can do prediction from CLI now! that's so cool! I tested model on Colab, found very basic results but i want to try via CLI and analyse! and i check that too whether the results get change or not!

Run predictions for the 1000 molecules, create the necessary plots and explain the results you are obtaining

The week is getting interesting and interesting as I am diving in. the aim of this week is to find whether the selected models are accurate and reproduceable. SO LONG STORY BEGINS to check the Model Bias - whether the model is generalizing properly or not! After that, its reproducibility - the trends and insights of predictions are similar to models paper and eos6oli results are giving similar kind of results or they are different? AND THE LAST BUT NOT THE LEAST "Performance" ! how model will react to unseen data?? let the story begin!

eos6oli

T1 Model bias : I ran the predictions in 1/4 week but could not able to give good insights as visualizations. POINT TO PONDER was that model was predicting Log S - measure of solubility in human body! In the paper, it was clearly mentioned that log S if is less than -4 is insoluble , if it is between -4 and -2 then partial soluble and lastly less than -2 its soluble! figures is classify among soluble, partial soluble and insoluble!

Histogram-soluble-insoluble-slighly soluble

For better visualization of the predicted results Pie chart can better describe the overall.

piechart

here we can see! most of the data samples are predicted to be insoluble. that could be further better visualize as

Histogram-Density-plot

Here we can see! most of the samples are falling in the -5.5 which is not too far from -4 (Threshold of soluble and insoluble)! the hump is showing that it is towards left! here my hypothesis is that the model is sightly biases. lets confirm our hypothesis and make it better!

SolTranNet

T2 Reproducibility: After reading paper couple of times, I got to know some important terms which i came across in highly school! I did the prediction from the same data used in Ersiliaeos6oli predictions. found out just that just minor difference in predictions. Overall,SoltranNet predictions were mostly similar to eos6oli predictions on same data.

In this figure, Soluble and Insoluble predictions are similar just .10% is different. histogram-2-SolTranNet

Similarly histogram-3-SolTranNet

Moreover, this figure is showing the majority of predictions at 5.5 histogram-all-SolTranNet

Overall distribution is also similar

piechart - SolTranNet

The main difference I got is the time! model predictions from ersilia eos6oli using colab is of 60.33 sec and for SoltranNetwas just 3 seconds.

Testing of TEST DATASET of SolTranNet

Before searching large dataset and typically consist of new compounds. I ran the predictions on TEST data of SolTranNet! these where the results where i compared with real experimented results and predicted results of SoltranNet and Ersilia eos6oli.

Experimented Y SoltranNet Eos6oli
y - 1 1 - solubility new
solubility - 2 roc eos6oli
y -4 solu 4 eso6oli
y -3 solu 3 eos6oli66
Accuracy: 0.894800120228434
Precision: 0.9035404624277457
Recall: 0.9678792569659442
F1 Score: 0.9346038863976084
Accuracy: 0.894800120228434
Precision: 0.9035404624277457
Recall: 0.9678792569659442
F1 Score: 0.9346038863976084
y-5 sol 5 eos 5
AUROC for Solubility: 0.8775305174392753
R2 Score for Solubility: 0.24743684531626464
AUROC for eos6oli Solubility: 0.8775305174392753
R2 Score for eos6oli Solubility: 0.24743684531626464
confusion M confusion eos

As we can see here! SoltranNet and Ersilias's eso6oli are providing similar results. here we can see that the performance if both on test data is similar!

DhanshreeA commented 3 months ago

Hi @Talikamuhib awesome work so far! I apologize I missed your comment earlier:

PS. I want to get prediction on something which is totally new for model! as it is trained on load of data! i feel it has been already seen this! can you help me out in finding new compounds dataset of 2022-23-24 of smiles and its labels!

Can you confirm if you still need this and I can look into this. :)

Talikamuhib commented 3 months ago

Hi @DhanshreeA, Thank you for 100X times hyping my motivation! It would be great if you help me out to find them! As i am doing the predictions from test datset used in testing the model! If i would get it it will be novel experiment! And it will see whether it is still applicable for newly generated compunds or not! Moreover, kindly share your comments on the work i did so that i would work more to make it worthy contribution ti Ersilia.

Talikamuhib commented 3 months ago

Week 3 - Validate a Model in the Wild

Recap:

Interestingly (I WAS SO EXCITED TO WORK) I started to work on week 3 in week 2 - thought it is part of week 2, I have used the dataset here of SoltranNet testing! specially to ensure the DATA LEAK is as minimum as possible. I ran the test data just to get trends of whether both the models works similarly or not and perform some interesting visualizations. Moreover, to make AUROC Curves I used threshold of -4 solubility - as per given in paper! it is not possible to build AUROC Curves on regression task! so converted to binary classification and gather to results.

Find a suitable dataset with sufficient experimental results

Finding dataset was huge journey for me! but this made my life easy as piece of cake. Moreover i found the the data i am using AS WILD DATASET. which is a suitable dataset with sufficient experimental results and Clean and standardized.

image

After analysis of this table, I came out with the sum up to take Chembl dataset! which has less redundancy and differences of Logs value is also less.

Results

best matrix to evaluation of prediction is RMSE and Mean Absolute Error. so the results were

Mean Absolute Error (MAE): 1.3373231591422308 Root Mean Squared Error (RMSE): 1.7280959712008543

and i really dont wanted to share but feels like R2 is in negative R-squared value: -0.8145716984892839

As the dataset of Chembl was 3 times bigger than training data! if we train on more data we could get better results!!!!

DhanshreeA commented 3 months ago

Hi @Talikamuhib good work, however, I think there has been some confusion. I see that you have utilized the testing dataset from SolTraNet for the model validation task (ie Week 3 T3). Can you confirm if that is indeed the case?

I also have some comments around clean up:

  1. Can you rename the data files and add a README within the data directory that mentions exactly which file is what? I am looking for something like: reference_library.csv or 1000_mols.csv (for task 1), test_data.csv (for task 2), and external_validation.csv (this is for task 3).
  2. Regarding obtaining a dataset, I would suggest you visit any external databanks such as Chembl or Pubchem, and look for a few compounds (identified by InchiKey) that are neither in the training set of SolTraNet or the test set used for task 2. Make sure these have an experimentally calculated value. You don't need to bother with which year they are from, they should just not be repeated in the datasets you have used for tasks 1 and 2. Hope this helps!

Remember, you do not need to finish all the tasks. Creating a final application is more important!

Talikamuhib commented 3 months ago

Hi @DhanshreeA. thank you for the feedback.

your concern about using test dataset for T3 is right! preparation of the my notebook for wild dataset so that i when i get that, i will just run the commands and do the visualizations and comments.

As per your suggestions I have added the README.md.

I found very amazing page. it had the data. As per the article i found, the redundance matrix. where Aqsol and Chembl have just 1.76 % redundancy! so it is better to use. So used it!

DhanshreeA commented 3 months ago

Looks good @Talikamuhib please create your final application!