Closed Ashita17 closed 3 months ago
Hi @Ashita17
We are on week 3 of the contribution period. Please let us know if you plan to continue with your contribution by the next 2 days, otherwise we will close this issue so we can focus on the applicants who want to make a final application to Ersilia.
I want to continue, I was just having some problems in the task 2 of week 2, will post the GitHub repo link soon for the deepherg model that I am currently working on
I wrote a motivation statement in the outreach contribution itself. However, I can see that people have written it on their issues. Sorry for the misunderstanding. Please review it here as well. @GemmaTuron @DhanshreeA
MOTIVATION STATEMENT
I wanted to join outreachy to work with a community of supportive mentors and peers which is missing in my college where technical clubs are more or less a gendered space ,which believe that only boys have good analytical and mathematical ability and girls are not at par with them. I am the one of the two girls in the data science club of my college out of 60 people there. I have a strong mathematical background as I rigorously prepared for and cracked the JEE exam in India.
I chose Ersilia because of the nature of project. I was particularly interested in doing an ML project where I can learn to run ML models efficiently. I have done some quality data science projects in my college club and its always fun to tackle new issues that come along especially when we are trying to improve the dataset such as making the frequency of the inputs more uniform, etc or when we are trying to improve the ML model by changing the various parameters and sometimes ending up employing a new method altogether in the end !
Through this contribution phase and hopefully the internship, I want to improve my skills to become a proficient data scientist who is good at handling data as well as developing new models. I particularly like machine learning because it is much more analytical and teaches one patience owing to its rigorous mathematical background. I read “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” and it deepened my curiosity to explore this field further.
After this internship, I plan to work as a quantitative research analyst in a renowned firm like Goldman Sachs or JPMorgan Chase & Co. to gain experience of real life projects in an environment of highly skilled individuals and maybe after a few years go for higher studies in machine learning. Most importantly, I would want to encourage young girls to put themselves out there in male dominated fields, most of which require mathematical and analytical ability. I would like to volunteer to teach them maths or the concepts of ML once I have the necessary skills.
Sorry for adding the adding the work late, I misunderstood that everything has to be attached in the form of contribution on the outreachy website.
I learned how to work with Conda environments and managing packages. I faced various errors while trying to run the ersilia model such as difficulty in relative imports, some problems related to the json file, etc. However after trying to fetch the model multiple times , I successfully downloaded ersilia and fetched the eos3b5e model as well.
Here I faced some difficulty as I was using docker for the first time and there was a lot of problem in fetching the model as the error would come up again and again: container is not ready yet, etc. However , I later switched to GitHub code spaces and was able to run fetch , run and serve the model eos4wt0.
I have chosen the model eos30f3. I successfully fetched and served the model and ran predictions on the 1000 molecules given in the reference_library.csv file. I halfway done with week 2 task 1. I have to evaluate the model and check its reproducibility to complete week 2 tasks. Please click on this link to see the progress
Hi @Ashita17
Can you provide a short summary of week 2 tasks when you complete them so we can provide feedback?
Definitely, I have been facing some issues in the task 2 of week two but I' ll share the summary very soon . Sorry for falling behind the schedule and thank you for your concern, it means a lot.
There were two sets of models we were supposed to choose from : hERG models and ADME models. I was very excited to experiment with the hERG models. Why? because they talked about cardiotoxicity! the biology subject area which excited me the most when I was in school !!
I initially chose the model eos30gr (deepherg) simply because its publication paper has beautifully explained about the multi-task deep neural network and it looked really fun to look into the Mol2vec Descriptors and MOE Descriptors and get to know more and more about both chemistry and machine learning. However, after trying numerous times, it kept giving me the same null input, I wasn't able to resolve the error even after my peers helped me on slack for which I am really grateful😇.
I finally decided to randomly try some other model and tested eos30f3 (dmpnn-herg). This model was working fine for all the smiles and giving outputs but there was something weird I noticed when I finally started to evaluate the results I ran on the reference library shared with us on slack.
It showed that the number of blockers in the dataset is much more than the number of non-blockers which is very unlikely. I couldn't trust its results and started to go through all the models again.
I was on the verge of giving up but I finally stumbled upon BayeshERG model (eos4tcc). The interesting thing about this model is that while the other models just give the hERG blocking activity , it gives two more parameters : epis (epistemic probability)and alea (aleatoric probability). As we can guess from the name, this model deals with not only predictions but also the uncertainities that may come along which highly increases its reliability. I have a deep interest in statistics and prbability and surely, this was THE model for me!
Its a bayesian graph neural network model.
The model takes 1000 smiles as input from the dataset "1000_values_dataset.csv". It generates inchikey and gives the following three outputs:
Comment on uncertainity : The uncertainity or VARIANCE is the sum total of quantified epistemic and aleatoric uncertainty.
I successfully fetched the model, served it and it gave me interpretable predictions in one go 🥹!
I went on to develop these results from the model:-
As discussed before, aleatoric uncertainity arises from inherent randomness of the data and this component of uncertainity can't be reduced even by gathering more information. On the other hand , epistemic unceratinity arises due to lack of information and therefore can be reduced by gathering more information. In this section, I would like to analyse the data for aleatoric and epistemic unceratinities to know if the model is lagging due to lack of information.
BayeshERG: a robust, reliable and interpretable deep learning model for predicting hERG channel blockers , Hyunho Kim, Minsu Park, Ingoo Lee, Hojung Nam,Briefings in Bioinformatics, Volume 23, Issue 4, July 2022, bbac211. Click here to see the publication paper.
@GemmaTuron @DhanshreeA , I have completed task 1 of week 2 and currently working on task 2 of week 2. Kindly review my tasks and suggest changes if any . Posting the link to the repo again here. Thank you :) .
@Ashita17 good work so far, you are in the right direction with Task 1. However I would recommend creating a final application starting next week and linking this issue there. Please do not attempt task 3 as we would not be able to provide feedback on that at this point. I can review your Task 2 work on Monday.
For this task we were supposed to do the following:
This was the longest process. I use MacBook Air M1 and the model specifically required CUDA , which is not supported by the MacM1 silicon chips, so I decided to work on colab. There as well, it was difficult to create a Conda environment and the terminal can be used only by users that have pro-access. So finally , I decided to use a friend's laptop. Here also I made some syntax errors again and again while following the instructions but finally I saw this on my screen and I was on cloud nine!
I had set the sampling time to 100 .The default time that the model mentions on the site is 30 but since the authors used 100 as the sampling time while testing the model, so I chose 100 as well.
I proceeded on to make comparisons between the BayeshERG model and eos4tcc model and came up with the following analysis.
Click here to see the notebook.
Similarity:
Differences:
Let's look at pie charts for even better visualisation !
Similarity:
Differences:
There are significant differences when it comes to the uncertainities.
Since the model focusses on both predictions as well as uncertainities let's talk about both!
THUS THE MODEL IS REPRODUCIBLE BUT NEEDS A LOT OF IMPROVEMENT AS APPRECIABLE DIFFERENCES EXIST!
@DhanshreeA @GemmaTuron kindly review my task 2 for week 2. Putting the link here for the notebook, thank you !
For this task we were supposed to do the following:
I found some datasets on Chembl which were giving the IC50 value for different molecules. I downloaded one csv file and manually deleted some columns as pandas could not read it due to some delimiters and special characters. Finally the dataframe looked somewhat like this.
That's right, it has 13,396 values !
After all these steps, the dataset which initially had 13,396 molecules ended up with just 2218 molecules. ( 1255 positive molecules or hERG blockers and 963 negative molecules or non-hERG blockers )
I fed the evaluation_dataset to the eos4tcc model and started evaluating the results.
I developed the following confusion matrix:-
Metric | Value |
---|---|
Accuracy | 0.497 |
Senstivity | 0.3179 |
Specificty | 0.732 |
Precision / Positive Predictive Value | 0.607 |
Recall | 0.3179 |
Negative Predictive Value | 0.4516 |
Balanced Accuracy | 0.525 |
Matthew's Correlation Coeffcicient | 0.05429 |
F1 Score | 0.417 |
CONCLUSION : Overall, we can say that our model has an appreciable bias towards reporting examples as negative while under-reporting the positive examples.
AUROC : It's value is 0.5 which means that the model is more or less just randomly classifying the examples as positive or negative. This happens because out model is very good at identifying the negative examples and at many times overestimating their number. On the other hand, it under-reports the positive examples. This results in an overall random performance.
R2 Score: It's -1 which means that the model performs worse than a naive baseline model and it's better to report an avarage value of the depenedent variable (here the label) than to report predictions from the model.
The area under th precision recall curve is 0.5 which means that the trade off between precsion and recall is equivalent to a random chanceand the model is ineffective in classifying between the negative and positive classes.
We can clearly see the the actual dataset contains more positive examples than negative examples but our model reports more negative examples than positive examples. Claerly, it's biased towards giving negative results.
The actual distribution has 56.6% positive examples and 43.4% negative examples. However, the model predicts 70.4% negative examples and 29.6% positive examples which again shows the bias of model towards giving negative results.
We reproduced the model in the second week. The dataset had more negative examples than positive examples. Our model gave decent results but gave an error 0f 14.7% in predciting hERG blockers. However, in the dataset we used in evaluation of model , the number of positive examplesc(1255) is larger than the number of negative examples(963).Since our model is biased towards reporting negative values, it performed poorly and almost acted like a random classifier. Thus , we need to work on the ability of the model to predict hERG blockers and especially the strong hERG blockers.
@DhanshreeA @GemmaTuron please find the link to the notebook for week 3 here . I had already started reading about Task 3 and working on it in week 2 and was really excited to try it out as I still had quite some time left. It is understandable if you can't review it due to shortage of time but if there's any chance you can review it, my hard work would pay off and it would mean a lot to me. Thank you :).
@Ashita17 you've done a very good job! Thank you for this work, it's quite helpful for us to go and review this model's implementation. It is likely a retrained model thus explaining the discrepancies in results. Please go ahead and submit your final application!
Just a small comment for the future, Pie charts are not a great idea for visualizing probability distributions. Everything else looks great! :)
Thank you so much for your appreciation ma'am, you made my Holi 10x better !! I'll keep the pie chart point in mind 👍🏼
Week 1 - Get to know the community
Week 2 - Get Familiar with Machine Learning for Chemistry
Week 3 - Validate a Model in the Wild
Week 4 - Prepare your final application