Ashita17 commented 3 months ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Install Docker if needed, and test another model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Get Familiar with Machine Learning for Chemistry

[x] Select a model from the list suggested in GitBook
[x] Download and serve the model via the Ersilia Model Hub to ensure it works
[x] Open a repository on your GitHub user with all the necessary files
[x] Select and clean a dataset of 1000 molecules (example notebook 1)
[x] Run predictions for the molecules on the selected model and evaluate the results

Week 3 - Validate a Model in the Wild

[x] Find a suitable dataset with sufficient experimental results
[x] Clean and standardize the dataset
[x] Run predictions and calculate metrics.

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

GemmaTuron commented 3 months ago

Hi @Ashita17

We are on week 3 of the contribution period. Please let us know if you plan to continue with your contribution by the next 2 days, otherwise we will close this issue so we can focus on the applicants who want to make a final application to Ersilia.

Ashita17 commented 3 months ago

I want to continue, I was just having some problems in the task 2 of week 2, will post the GitHub repo link soon for the deepherg model that I am currently working on

Ashita17 commented 3 months ago

I wrote a motivation statement in the outreach contribution itself. However, I can see that people have written it on their issues. Sorry for the misunderstanding. Please review it here as well. @GemmaTuron @DhanshreeA

MOTIVATION STATEMENT

I wanted to join outreachy to work with a community of supportive mentors and peers which is missing in my college where technical clubs are more or less a gendered space ,which believe that only boys have good analytical and mathematical ability and girls are not at par with them. I am the one of the two girls in the data science club of my college out of 60 people there. I have a strong mathematical background as I rigorously prepared for and cracked the JEE exam in India.

I chose Ersilia because of the nature of project. I was particularly interested in doing an ML project where I can learn to run ML models efficiently. I have done some quality data science projects in my college club and its always fun to tackle new issues that come along especially when we are trying to improve the dataset such as making the frequency of the inputs more uniform, etc or when we are trying to improve the ML model by changing the various parameters and sometimes ending up employing a new method altogether in the end !

Through this contribution phase and hopefully the internship, I want to improve my skills to become a proficient data scientist who is good at handling data as well as developing new models. I particularly like machine learning because it is much more analytical and teaches one patience owing to its rigorous mathematical background. I read “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” and it deepened my curiosity to explore this field further.

After this internship, I plan to work as a quantitative research analyst in a renowned firm like Goldman Sachs or JPMorgan Chase & Co. to gain experience of real life projects in an environment of highly skilled individuals and maybe after a few years go for higher studies in machine learning. Most importantly, I would want to encourage young girls to put themselves out there in male dominated fields, most of which require mathematical and analytical ability. I would like to volunteer to teach them maths or the concepts of ML once I have the necessary skills.

Ashita17 commented 3 months ago

Sorry for adding the adding the work late, I misunderstood that everything has to be attached in the form of contribution on the outreachy website.

Ashita17 commented 3 months ago

Week 1

Installing Ersilia

I learned how to work with Conda environments and managing packages. I faced various errors while trying to run the ersilia model such as difficulty in relative imports, some problems related to the json file, etc. However after trying to fetch the model multiple times , I successfully downloaded ersilia and fetched the eos3b5e model as well.

Testing Ersilia with docker

Here I faced some difficulty as I was using docker for the first time and there was a lot of problem in fetching the model as the error would come up again and again: container is not ready yet, etc. However , I later switched to GitHub code spaces and was able to run fetch , run and serve the model eos4wt0.

Ashita17 commented 3 months ago

Week 2

I have chosen the model eos30f3. I successfully fetched and served the model and ran predictions on the 1000 molecules given in the reference_library.csv file. I halfway done with week 2 task 1. I have to evaluate the model and check its reproducibility to complete week 2 tasks. Please click on this link to see the progress

GemmaTuron commented 3 months ago

Hi @Ashita17

Can you provide a short summary of week 2 tasks when you complete them so we can provide feedback?

Ashita17 commented 3 months ago

Definitely, I have been facing some issues in the task 2 of week two but I' ll share the summary very soon . Sorry for falling behind the schedule and thank you for your concern, it means a lot.

Ashita17 commented 3 months ago

Week 2 Update

Choosing the Model

There were two sets of models we were supposed to choose from : hERG models and ADME models. I was very excited to experiment with the hERG models. Why? because they talked about cardiotoxicity! the biology subject area which excited me the most when I was in school !!

I initially chose the model eos30gr (deepherg) simply because its publication paper has beautifully explained about the multi-task deep neural network and it looked really fun to look into the Mol2vec Descriptors and MOE Descriptors and get to know more and more about both chemistry and machine learning. However, after trying numerous times, it kept giving me the same null input, I wasn't able to resolve the error even after my peers helped me on slack for which I am really grateful😇.

I finally decided to randomly try some other model and tested eos30f3 (dmpnn-herg). This model was working fine for all the smiles and giving outputs but there was something weird I noticed when I finally started to evaluate the results I ran on the reference library shared with us on slack.

It showed that the number of blockers in the dataset is much more than the number of non-blockers which is very unlikely. I couldn't trust its results and started to go through all the models again.

BayeshERG model (eos4tcc)

I was on the verge of giving up but I finally stumbled upon BayeshERG model (eos4tcc). The interesting thing about this model is that while the other models just give the hERG blocking activity , it gives two more parameters : epis (epistemic probability)and alea (aleatoric probability). As we can guess from the name, this model deals with not only predictions but also the uncertainities that may come along which highly increases its reliability. I have a deep interest in statistics and prbability and surely, this was THE model for me!

Its a bayesian graph neural network model.

Understanding of the model 💭

The model takes 1000 smiles as input from the dataset "1000_values_dataset.csv". It generates inchikey and gives the following three outputs:

score: This is the hERG channel inhibition percentage at 10µm. BayeshERG is a graph neural network (GNN)-based Bayesian DL model. It calculates the hERG blocking probability using the softmax function. If the score is ≥ 0.5, the molecule is calssified as a hERG blocker and if the score is < 0.5 it is considered to be a non-hERG blocker. This score helps us to perform classification labelling the blackers as 1 and non-blockers as 0
alea: This refers to the "aleatoric" uncertainity which is referred to as the model uncertainity due to the lack of data.
epis: This refers to the "epistemic" uncertainity which is referred to as the model uncertainity due to data’s intrinsic randomness which may be caused due to noise inherent in the dataset.

Comment on uncertainity : The uncertainity or VARIANCE is the sum total of quantified epistemic and aleatoric uncertainty.

TASK 1 - Reproducibility

I successfully fetched the model, served it and it gave me interpretable predictions in one go 🥹!

I went on to develop these results from the model:-

Inference from the histogram

Most of the molecules (about 25%) have hERG blocking activity score in the range (0.1-0.2)
Molecules with very strong hERG blocking activity i.e. hERG blocking activity score > 0.95 are just approximately 0.5%.
Unexpected result: I did not expect that there will be molecules in the score range (0.6-0.7) and (0.8-0.9) but suprisingly that's the case!

Inferences from the pie chart

We can see that more than 50 percent of the molecules have hERG blocking activity score less than 0.4 which means that they are non-blockers.
This data is very well in agreement with any random dataset that we come across in real life where the number of non-hERG blockers dominates the number of hERG blockers
Let's go ahead and make this visualisation even better!

Let's dive into uncertainities ! 💡

As discussed before, aleatoric uncertainity arises from inherent randomness of the data and this component of uncertainity can't be reduced even by gathering more information. On the other hand , epistemic unceratinity arises due to lack of information and therefore can be reduced by gathering more information. In this section, I would like to analyse the data for aleatoric and epistemic unceratinities to know if the model is lagging due to lack of information.

Inference from the line graphs

From the above graph, we can see that the maximum uncertainity arises when we are in the zone of hERG blocking activity score values between 0.4 to 0.6 which means the predictions are least trused when they are made for molecules lying in this range.
Scope of improvement: we can see from the above graph that although the variance/uncertainity is quite high for the range 0.4-0.6, most of it comes from the aleatoric component which is the non-reducible part of uncertainity as discussed before. Only a very small portion of uncertainity comes from the epistemic part (reducible part) but this portion is quite insignificant when looking at the total variance/uncertainity.
Thus, the model performs very well and most of the uncertainity it faces is due to the inherent randomness present in data which cannot be ruled out.
If improvements have to be made, they should be made for molecules in the score range 0.4-0.8 where the epistemic component forms maximum percentage of total variance. This can be done by collecting more data for this range. Since our main culprit is the epistemic component ,let's further analyse this component using scatter plots!

Scatter Plot - Veryfying our observations for aleatoric uncertainity !

Inferences from scatter plot

Clearly we can see that we have a lot of data points in the range 0.0-0.4 and here the aleatoric uncertainity is very less.
However, as we move to the 0.4-0.8 score range, the number of data points decreases and the aleatoric uncertainity increases due to lack of data.
For the 0.8-1.0 range, very few molecules are present so solid conclusions can't be made!

Reference to Original authors

BayeshERG: a robust, reliable and interpretable deep learning model for predicting hERG channel blockers , Hyunho Kim, Minsu Park, Ingoo Lee, Hojung Nam,Briefings in Bioinformatics, Volume 23, Issue 4, July 2022, bbac211. Click here to see the publication paper.

Ashita17 commented 3 months ago

@GemmaTuron @DhanshreeA , I have completed task 1 of week 2 and currently working on task 2 of week 2. Kindly review my tasks and suggest changes if any . Posting the link to the repo again here. Thank you :) .

DhanshreeA commented 3 months ago

@Ashita17 good work so far, you are in the right direction with Task 1. However I would recommend creating a final application starting next week and linking this issue there. Please do not attempt task 3 as we would not be able to provide feedback on that at this point. I can review your Task 2 work on Monday.

Ashita17 commented 3 months ago

Week 2 Update

Task 2 - Reproducibility

For this task we were supposed to do the following:

Select a dataset to reproduce the result of the model through original implementation: Here, I have taken EX4.csv which was the external dataset used by the authors of the publication for testing the BayeshERG model.
Identify a result you could reproduce from the publication: Since the model deals not only with the predictions but also the uncertainities that come along, I have decided to compare the hERG blocking activity score , epistemic and aleatoric unceratainity as well as total variance of the publication implementation (BayeshERG) vs the model fetched from ersilia (eos4tcc).
Check that the model provides the same results when running via the Ersilia Model Hub: In the last section , I will compare the results obatined through the two methods.

Setting up BayeshERG on my system

This was the longest process. I use MacBook Air M1 and the model specifically required CUDA , which is not supported by the MacM1 silicon chips, so I decided to work on colab. There as well, it was difficult to create a Conda environment and the terminal can be used only by users that have pro-access. So finally , I decided to use a friend's laptop. Here also I made some syntax errors again and again while following the instructions but finally I saw this on my screen and I was on cloud nine! WhatsApp Image 2024-03-23 at 14 14 49 I had set the sampling time to 100 .The default time that the model mentions on the site is 30 but since the authors used 100 as the sampling time while testing the model, so I chose 100 as well.

I proceeded on to make comparisons between the BayeshERG model and eos4tcc model and came up with the following analysis.

Comparison between predictions of BayeshERG and eos4tcc Models

Click here to see the notebook.

Bar Chart 📊

bar_graph

Inferences

Similarity:

Relative number of hERG and non-hERG blockers: We can see that in both the predictions, the number of non-hERG blockers (hERG blocking activity score < 0.5) is greater than the number of hERG-blockers (hERG blocking activity score >= 0.5).

Differences:

BIAS: BayeshERG model gives an appreciable number of strong hERG blockers (hERG blocking activity score > 0.95) but the eos4tcc model reports no strong hERG blockers. We can say that eos4tcc is biased towards making non-hERG predictions and hardly reports the strong hERG blockers. Also , it has a bias towards reporting very weak-hERG blocking activity as it significantly reports much more molecules in the lower range of hERG blocking activity score (0-0.2) in comparison to BayeshERG.

Let's look at pie charts for even better visualisation !

Pie Charts 🟡

pie_charts

Inferences

Similarity:

Both the models report approximately 65% non-blockers and 35% blockers

Differences:

BIAS: We can see that there are hardly any strong blockers for eos4tcc model (red zone) but they do exist in the BayeshERG model. Thus, eos4tcc is biased towards reporting molecules as non-hERG blockers and hardly reports strong hERG-blockers.
Percentage Difference : In terms of overall predictions, eos4tcc detects 14.7% less hERG blockers in comparsion to the original publication model BayeshERG.

Comparing the uncertainities 🤔

uncertainity

Inferences

There are significant differences when it comes to the uncertainities.

Overall variance: We can see that the overall variance almost follows a normal distribution for the BayeshERG model whereas for the eos4tcc model, it increases in the beginning, reaches a peak value and then falls. However, there is a similarity in terms of the maximum value of variance which for both the models remains around 0.25.
Epistemic uncertainity: It remains almost constant for both the models but for the eos4tcc model, it is lower . This means that the model is under-reporting the epistemic uncertainity.
Aleatoric uncertainity: It varies just like the variance, follows a somewhat normal distribution for BayeshERG model and increases in the beginning, reaches a peak value and then falls for the eos4tcc model.

Final Comment on Reproducibility ✔️

Since the model focusses on both predictions as well as uncertainities let's talk about both!

Predictions: The models eos4tcc and give similar results but the error in reporting hERG blocker is 14.7% which an appreciable value. eos4tcc is hardly able to predict any strong hERG blockers whereas BayeshERG does which is a strong feature of BayeshERG as mentioned in the publication.
Uncertainities: When it comes to the overall uncertainity/variance values, the models show significant difference in terms of the pattern they follow although the maximum variance reported by both the model remians about 0.25. The epistemic uncertainity doesn't vary much and the aleatoric uncertainity folows a pattern very similar to the variance.

THUS THE MODEL IS REPRODUCIBLE BUT NEEDS A LOT OF IMPROVEMENT AS APPRECIABLE DIFFERENCES EXIST!

Ashita17 commented 3 months ago

@DhanshreeA @GemmaTuron kindly review my task 2 for week 2. Putting the link here for the notebook, thank you !

Ashita17 commented 3 months ago

Week 3 Update

Task 3 - Performance

For this task we were supposed to do the following:

find a dataset from public repository . I obtained the dataset from here.
make sure that there are no repeated molecules between the train set used in the model and the external dataset we have curated to eliminate bias.
use this dataset to evaluate the model on various performance metrics

Search for the dataset

I found some datasets on Chembl which were giving the IC50 value for different molecules. I downloaded one csv file and manually deleted some columns as pandas could not read it due to some delimiters and special characters. Finally the dataframe looked somewhat like this.

That's right, it has 13,396 values !

Processing the dataframe

I saw that some values in the fields 'Standard Relation' and 'Standard Units' did not have the type 'string' so I eliminated those rows
Next, the IC50 values for some molecules were not exactly given, but just an approximate value in terms of >=,<=,etc. I eliminated the molecules for which approximate values were given to deal with only exact values for IC50. Also , the units in 'Standard Units' were 'nM' ,' mg-mol' etc. Since 'nM' is easy to deal with, I took only those molecules for which the IC50 values were given in 'nM'
I removed all the nan values.
According to the paper, a molecule is a hERG blocker if the IC50<10µm . Otherwise, it's a non-hERG blocker. So I defined a function that would check in each row if the IC50 value is less than 10,000 nM, it would assign that molecule label '1' otherwise label '0'.
Then I created the final dataframe with standard smiles, InchiKey and label as the fields and stored it in the data folder as an evaluation_dataset.csv.

Removing the common values

The model BayeshERG was first pre-retrained using 2 datasets and then it was fine-tuned using 5 other datasets. All of these datasets can be found here
I combined these datasets to find all the smiles that exist in them
Then , I removed all of these smiles from the evaluation_dataset.

After all these steps, the dataset which initially had 13,396 molecules ended up with just 2218 molecules. ( 1255 positive molecules or hERG blockers and 963 negative molecules or non-hERG blockers )

I fed the evaluation_dataset to the eos4tcc model and started evaluating the results.

I developed the following confusion matrix:-

Confusion matrix

plot6

Final Results

Metric	Value
Accuracy	0.497
Senstivity	0.3179
Specificty	0.732
Precision / Positive Predictive Value	0.607
Recall	0.3179
Negative Predictive Value	0.4516
Balanced Accuracy	0.525
Matthew's Correlation Coeffcicient	0.05429
F1 Score	0.417

Comments on Results

Accuracy and balanced accuracy : these values are nearly 0.5 which means that the classification is almost random and our model has a poor performance.
Senstivity: The value is very small ( 0.3179 < 0.5 ). This means of all the positive examples present in the dataset, our classsifier can correctly identify 31% of them showing it's inability to identify a positive molecule.
Specificity: The value is 0.732 > 0.5 , which means our classifier is good at correctly classifying the negative exammples which is expected as it is biased towards identifying a molecule as negative.
Precision: The value 0.607 > 0.5, this means if our model classifies an example as positive, it has a 60.7% chance to be actually positive. This is expected as our model is biased to give negative results and when it does report a molecule as positive, there is a good chance that it is actually postive.
Recall:It's value if 0.3179 < 0.5. This is also expected as recall tells us that out of all the positive examples, what proportion is our model able to identify. We know that our model has a tendency to under-report the positive examples and thus is able to detect only 31.795 of the total positive examples.
Negative Predictive value: It is the proportion of examples classified as negative which are actually negative. It's value is 0.4516 < 0.5. This is because our model has a great tendency to report a molecule as negative even though it might be positive. So, if a molecule has been reported as negative, there is just 45.16% chance that it is actually negative.
Matthew correlation coefficient: It's value is almost 0 which again shows our model behaves as a random classifier.
F1 score: It's low ( just 0.417 ) which indicates poor precision and poor recall.

CONCLUSION : Overall, we can say that our model has an appreciable bias towards reporting examples as negative while under-reporting the positive examples.

Receiver Operating Characteristic Curve

Inferences

AUROC : It's value is 0.5 which means that the model is more or less just randomly classifying the examples as positive or negative. This happens because out model is very good at identifying the negative examples and at many times overestimating their number. On the other hand, it under-reports the positive examples. This results in an overall random performance.

R2 Score: It's -1 which means that the model performs worse than a naive baseline model and it's better to report an avarage value of the depenedent variable (here the label) than to report predictions from the model.

Precision Recall Curve

Inferences

The area under th precision recall curve is 0.5 which means that the trade off between precsion and recall is equivalent to a random chanceand the model is ineffective in classifying between the negative and positive classes.

Comparison between Actual Districution and predicted distribution

Bar Graph 📊

plot5

Inferences

We can clearly see the the actual dataset contains more positive examples than negative examples but our model reports more negative examples than positive examples. Claerly, it's biased towards giving negative results.

Pie chart 🟡

pie_charts_actual_predicted

Inferences

The actual distribution has 56.6% positive examples and 43.4% negative examples. However, the model predicts 70.4% negative examples and 29.6% positive examples which again shows the bias of model towards giving negative results.

Final Note ✔️

We reproduced the model in the second week. The dataset had more negative examples than positive examples. Our model gave decent results but gave an error 0f 14.7% in predciting hERG blockers. However, in the dataset we used in evaluation of model , the number of positive examplesc(1255) is larger than the number of negative examples(963).Since our model is biased towards reporting negative values, it performed poorly and almost acted like a random classifier. Thus , we need to work on the ability of the model to predict hERG blockers and especially the strong hERG blockers.

Ashita17 commented 3 months ago

@DhanshreeA @GemmaTuron please find the link to the notebook for week 3 here . I had already started reading about Task 3 and working on it in week 2 and was really excited to try it out as I still had quite some time left. It is understandable if you can't review it due to shortage of time but if there's any chance you can review it, my hard work would pay off and it would mean a lot to me. Thank you :).

DhanshreeA commented 3 months ago

@Ashita17 you've done a very good job! Thank you for this work, it's quite helpful for us to go and review this model's implementation. It is likely a retrained model thus explaining the discrepancies in results. Please go ahead and submit your final application!

Just a small comment for the future, Pie charts are not a great idea for visualizing probability distributions. Everything else looks great! :)

Ashita17 commented 3 months ago

Thank you so much for your appreciation ma'am, you made my Holi 10x better !! I'll keep the pie chart point in mind 👍🏼

ersilia-os / ersilia

✍️ Contribution period: <Ashita Srivastava> #1041

Week 1 - Get to know the community

Week 2 - Get Familiar with Machine Learning for Chemistry

Week 3 - Validate a Model in the Wild

Week 4 - Prepare your final application

Week 1

Installing Ersilia

Testing Ersilia with docker

Week 2

Week 2 Update

Choosing the Model

BayeshERG model (eos4tcc)

Understanding of the model 💭

TASK 1 - Reproducibility

Inference from the histogram

Inferences from the pie chart

Let's dive into uncertainities ! 💡

Inference from the line graphs

Scatter Plot - Veryfying our observations for aleatoric uncertainity !

Inferences from scatter plot

Reference to Original authors

Week 2 Update

Task 2 - Reproducibility

Setting up BayeshERG on my system

Comparison between predictions of BayeshERG and eos4tcc Models

Bar Chart 📊

Inferences

Pie Charts 🟡

Inferences

Comparing the uncertainities 🤔

Inferences

Final Comment on Reproducibility ✔️

Week 3 Update

Task 3 - Performance

Search for the dataset

Processing the dataframe

Removing the common values

Confusion matrix

Final Results

Comments on Results

Receiver Operating Characteristic Curve

Inferences

Precision Recall Curve

Inferences

Comparison between Actual Districution and predicted distribution

Bar Graph 📊

Inferences

Pie chart 🟡

Inferences

Final Note ✔️