ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
224 stars 147 forks source link

🦠 Model Request: Blood Brain Barrier Penetration #426

Closed GemmaTuron closed 2 years ago

GemmaTuron commented 2 years ago

Model Title

BBB (TDC dataset)

Publication

Hello @sayantani11 !

As part of your Outreachy contribution, we have assigned you the dataset "BBB" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

sayantani11 commented 2 years ago

Basically with the help of this dataset we are trying to predict if a drug is able to penetrate the protection layer of a membrane to be able to reach the actual site of action. BBB is a dataset that has 1975 drugs.

sayantani11 commented 2 years ago

Hi @GemmaTuron here's the colab. I would really like some feedback until this point. Thank you!

sayantani11 commented 2 years ago

Summary of Work till now

BBB contains Binary Classification Problems where we are given the drug in the form of SMILES format and we have to predict if the drug is able to penetrate the protection layer in the brain that blocks foreign drugs. Here, we are basically classifying drugs that are able to 'cross' and 'doesn't cross' the membrane.

Loading of Data After I split the given data, the length of the subsets are: Train = 1421 Validation = 203 Test = 406


Analyzing of Data

Then, I had to find out the active molecules & inactive molecules present in each of them. The obtained results are: Train Dataset Active molecules: 1096 Inactive molecules: 325

Validation Dataset Active molecules: 152 Inactive molecules: 51

Test Dataset Active molecules: 303 Inactive molecules: 103

image


With the help of RDKIT package I was able to visualize few of the active and inactive molecules. Active Molecules image Inactive Molecules image


Preparing Data for Modelling We have created lists of the column "Y" as Y_train , Y_valid & Y_test and similarly the column "Drug" is used for creating a list of the smiles i.e. smiles_train, smiles_valid & smiles_test.

sayantani11 commented 2 years ago

Training of the ML model I then installed the Lazy-QSAR from GitHub which will help in training my model, this package is an ersilia-in-built library that provides a faster way of building models. It allows us to train the data i.e. 'fit' the data using the train dataset. Model training is the most crucial step in this process. With this, I will be able to form a model, that can be sent for testing and deployment. The better the training will be, better will be the accuracy of the result. Here, there are three columns in the list, DRUG_ID, DRUG and Y. For the next step, I took Y as the target variable and DRUG will be used as input variable. The lazy QSAR library takes DRUG as the input and converts it to Morgan Fingerprints. Morgan fingerprints are a way to represent molecules as mathematical objects. This is done because computer can't read the smiles format, as computer can only read 0s and 1s. I did this by using MorganBinaryClassifer that's available in LAZY QA. The model fitting is done using the input variable and target variable and then saved it in joblib format.

sayantani11 commented 2 years ago

Evaluation of the model After training of the model is done, I obtain the predictions which I then use for the valid and test dataset. After testing of the dataset I saw that the prediction are probabilities cause all the values are between 0 & 1. The required libraries like Pandas, NumPy, Scit-Learn, sklearn, seaborn, sklearn were imported at the very start. As, we were explained, i used performance mesaurements like f1_score, auc & roc curve/value, precision and recall.

Here, the prediction output of both valid and test are probabilities, but we need to convert them to integers so I go for an optimum threshold value i.e. 0.654. As we were explained, I used this value and the values which were less than the threshold value were mapped to zero and the values equal to or more than the threshold value were mapped as 1. I defined a function find for the same.

Confusion matrix

Validation set image 30 molecules are classified as "Doesn't Cross" and actually do not cross the barrier.(TRUE NEGATIVES) 21 molecules are classfied as "Crosses" but actually do not cross the barrier.(FALSE NEGATIVES) 6 molecules are classified as "Doesn't Cross" but actually cross the barrier.(FALSE POSITIVES) 146 molecules are classified as "Crosses" and actually crosses the barrier.(TRUE POSITIVES)

Test set image 50 molecules are classified as "Doesn't Cross" and actually do not cross the barrier.(TRUE NEGATIVES) 53 molecules are classfied as "Crosses" but actually do not cross the barrier.(FALSE NEGATIVES) 13 molecules are classified as "Doesn't Cross" but actually cross the barrier.(FALSE POSITIVES) 290 molecules are classified as "Crosses" and actually crosses the barrier.(TRUE POSITIVES)

AUC/ROC curve and value performance (validation set)

image The predicted score is 0.839 which is very much closer to 1, which implies that the prediction of the model will be mostly accurate. It will be give correct True positives and True negatives.

Model performance

Validation set Precision score: 0.874251 Recall score: 0.960526 f1 score: 0.915361

Test set Precision score: 0.845481 Recall score: 0.957096 f1 score: 0.897833

GemmaTuron commented 2 years ago

Hi @sayantani11 !

Thanks for this good job, I'll go ahead and close the issue as completed and you can focus on your final application to the program!