ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
225 stars 146 forks source link

🦠 Model Request: PGP #428

Closed GemmaTuron closed 2 years ago

GemmaTuron commented 2 years ago

Model Title

PGP (TDC Dataset)

Publication

Hello @ZakiaYahya!

As part of your Outreachy contribution, we have assigned you the dataset "PGP" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

ZakiaYahya commented 2 years ago

Thanks @GemmaTuron.

ZakiaYahya commented 2 years ago

PGP Dataset overview: I'm working with PGP (P-glycoprotein) Dataset. P-glycoprotein (Pgp) is an ABC transporter protein involved in intestinal absorption, drug metabolism, and brain penetration, and its inhibition can seriously alter a drug's bioavailability and safety. In addition, inhibitors of Pgp can be used to overcome multidrug resistance. The PGP dataset contains 1,218 drugs.

Load your Data: I have successfully installed TDC( Therapeutics Data Commons) package. After that, i imported the PGP dataset and split the compressed dataset into three sets i.e. Train, Validation and Test sets and saved each of the set in csv format in my google drive.

ZakiaYahya commented 2 years ago

Hello @GemmaTuron, I have a question. Can we use round( ) functions to map predicted probabilities either to 1 or 0, is it okay to do that?? I'm drawing a confusion matrix and for this we have to convert predicted probabilities to labels. Kindly guide me on this?

confusion

ZakiaYahya commented 2 years ago

Data Visualization: Using matplotlib we plot the train datapoints. Screenshot is attached below: image From the plot we see that our dataset comprises of two class i.e. Active or InActive. Hence it's a binaey classification Problem.

In order to visualize the molecular structure of data we use RDKIT to draw Active and InActive molecules. For visualization, i draw the first Active and InActive molecules from the train set. Active Molecule: image

InActive Molecule: image

Train your ML Model: To train ML model, we first prepare the data accordingly. As our data comprises of Drug ID, Drug and Y(Bioactivity). For training, we take Drug as a smile input to our ML model and pass Y as it's corresponding bioactivity so model can learn on it. We are using Lazy-QSAR model for training purpose, so we don't need to convert smiles into signatures as lazy-qsar package automatically converts smiles to morgan fingerprints. So, we directly pass smiles as an input to lazy-qsar model.

Evaluate your model: In order to Evaluate model performance i.e. how good model learns the data, we use following performance measure:

  1. Precision & Recall
  2. AUROC value
  3. AUC graph
  4. & Confusion matrix also known as Contingency Matrix

Performance measure on Validation Set:

Performance measure on Test Set:

@GemmaTuron Here's the link to notebook colab

ZakiaYahya commented 2 years ago

Hi @GemmaTuron i have done training and getting performance metrics on PGP dataset using LAZY-QSAR model. Now, i'm understanding the results that i got. Thanks.

ZakiaYahya commented 2 years ago

Interpretation of Model Evaluation using Performance metrics: Before going on the discussion of performance metrics we get on both validation set and test set, first we talk about balance or imbalance dataset. Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations i.e one class label has a very high number of observations and the other has a very low number of observations while in case of balanced dataset the both classes has nearly or almost equal number of observations.

We first check our train set : (461-> Active) & ( 391 ->InActive). In our train set, we see that the data is neither imbalanced in nature, or not highly balanced as well but it's nearly or you can say almost balanced. So, the chances that the model get baised towards one class in training process is very low which is good and which is what we want in our classification task. Likewise, our validation set comprises (66-> Active) & (56 ->InActive) which is nearly balanced too and test set comprises of (123-> Active) & ( 121 ->InActive) which is almost balanced.

Confusion Matrix: When we talking about classification problem, confusion matrix is a good start to check the performance of our trained model, As we see from the confusion matrices attached below, the number of True Positives(TP) and True negatives(TN) are much higher as compared to False positives(FP) and False Negatives(FN) which indicates our model is working good. Let's deep dive into the numbers we get in each case, Validation Case: Number of molecules correctly predicted (TP+TN)(52+45=97) and number of molecules predicted incorrectly (FP+FN)(11+14=25). That means 97 out of 122 molecules are predicted correctly while 25 out of 122 are predicted incorrectly. Test Case: Likewise in case of test set , Number of molecules predicted correctly (TP+TN)(101+98=199) and number of molecules predicted wrongly (FP+FN)(23+22=45). That means 199 out of 244 molecules are predicted correctly while 45 out of 244 are predicted incorrectly. image image

Hence from confusion matrix, we see that model is good.

AUROC curve: AUC and ROC curves are appropriate when the observations are balanced in between each class. As our dataset is nearly balanced so it's a good metric to observe the funtionality of our trained model. ROC curve shows the relationship between the true alarms (hit rate) and false alarms while AUC measures how good the model in separability of classes. Validation Case: Before visualizing the curves, the AUROC value we get in this case is 0.863 which is pretty good. Test Case: the AUROC value we get in this case is 0.876 which is also good. Now from the below curves we clearly see that, that ROC curve is much towards the 1 and the AUC is much higher which indicates that model is capable of separating Actives from InActive molecules.

image image

Precision & Recall: Precision is "how many retrieved items are relevant" while recall is "how many relevent items are retrieved".

Validation Case: Precision =0.8253 & Recall =0.7878 Test Case: Precision =0.8145 & Recall =0.8211

From above values of precision and recall we see that all values are closer to 1, which is good. As precision==1 (100%) shows that every positive prediction is correct with no False positive (FP==0) while recall==1 (100%) shows that every negative prediction is correct with no False negatives (FN==0)

GemmaTuron commented 2 years ago

Hello @GemmaTuron, I have a question. Can we use round( ) functions to map predicted probabilities either to 1 or 0, is it okay to do that?? I'm drawing a confusion matrix and for this we have to convert predicted probabilities to labels. Kindly guide me on this?

confusion

Hi @ZakiaYahya,

I use a for loop and iterate over the list to create a list with only 0 and 1. This allows you to specify the threshold of probability (otherwise you are always using the 0.5). Can you try this for loop and show me the code?

GemmaTuron commented 2 years ago
  • Data Analysis: The PGP (P-glycoprotein) Dataset of TDC is a binary dataset i.e. the associated bioactivity of the molecules is either Active or InActive. In order to train ML/AI models we need to split our datasets into three splits i.e. train, validation and test sets. Train set: this set is used to train our ML model, the corresponding bioactivity of this set is known and is also used to train the ML model. Train set comprises of total 852 molecules out of 1218 molecules of PGP dataset of TDC. From 852 molecules of data, there are 461 of Active molecules while 391 are InActive molecules. Validation set: this set is used to evaluate the given trained ML model, the corresponding bioactivity of this set is also known. This data set is used to fine-tune the model hyperparameters to enhance the performance of ML model. Validation set comprises of total122 molecules out of 1218 molecules of PGP dataset of TDC. From 122 molecules of data, there are 66 of Active molecules while 56 are InActive molecules. Test set: this set is a separate set of data used to test the model after completing the training of a ML model. Test set comprises of total 244 molecules out of 1218 molecules of PGP dataset of TDC. From 244 molecules of data, there are 123 of Active molecules while 121 are InActive molecules.

Data Visualization: Using matplotlib we plot the train datapoints. Screenshot is attached below: image From the plot we see that our dataset comprises of two class i.e. Active or InActive. Hence it's a binaey classification Problem.

In order to visualize the molecular structure of data we use RDKIT to draw Active and InActive molecules. For visualization, i draw the first Active and InActive molecules from the train set. Active Molecule: image

InActive Molecule: image

Train your ML Model: To train ML model, we first prepare the data accordingly. As our data comprises of Drug ID, Drug and Y(Bioactivity). For training, we take Drug as a smile input to our ML model and pass Y as it's corresponding bioactivity so model can learn on it. We are using Lazy-QSAR model for training purpose, so we don't need to convert smiles into signatures as lazy-qsar package automatically converts smiles to morgan fingerprints. So, we directly pass smiles as an input to lazy-qsar model.

Evaluate your model: In order to Evaluate model performance i.e. how good model learns the data, we use following performance measure:

  1. Precision & Recall
  2. AUROC value
  3. AUC graph
  4. & Confusion matrix also known as Contingency Matrix

Performance measure on Validation Set:

  • Precision = 0.8253 & Recall = 0.7878
  • AUROC value = 0.863
  • AUC graph image
  • Confusion Matrix image

Performance measure on Test Set:

  • Precision = 0.8145 & Recall = 0.8211
  • AUROC value = 0.8763
  • AUC graph image
  • Confusion Matrix image

@GemmaTuron Here's the link to notebook colab

Good job! But we are plotting binary data in the first graph, whereas a histogram is used to plot continuous data. There are more adequate plots for binary data (barplots, pie charts...) can you use any of those?

GemmaTuron commented 2 years ago

Interpretation of Model Evaluation using Performance metrics: Before going on the discussion of performance metrics we get on both validation set and test set, first we talk about balance or imbalance dataset. Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations i.e one class label has a very high number of observations and the other has a very low number of observations while in case of balanced dataset the both classes has nearly or almost equal number of observations.

We first check our train set : (461-> Active) & ( 391 ->InActive). In our train set, we see that the data is neither imbalanced in nature, or not highly balanced as well but it's nearly or you can say almost balanced. So, the chances that the model get baised towards one class in training process is very low which is good and which is what we want in our classification task. Likewise, our validation set comprises (66-> Active) & (56 ->InActive) which is nearly balanced too and test set comprises of (123-> Active) & ( 121 ->InActive) which is almost balanced.

Confusion Matrix: When we talking about classification problem, confusion matrix is a good start to check the performance of our trained model, As we see from the confusion matrices attached below, the number of True Positives(TP) and True negatives(TN) are much higher as compared to False positives(FP) and False Negatives(FN) which indicates our model is working good. Let's deep dive into the numbers we get in each case, Validation Case: Number of molecules correctly predicted (TP+TN)(52+45=97) and number of molecules predicted incorrectly (FP+FN)(11+14=25). That means 97 out of 122 molecules are predicted correctly while 25 out of 122 are predicted incorrectly. Test Case: Likewise in case of test set , Number of molecules predicted correctly (TP+TN)(101+98=199) and number of molecules predicted wrongly (FP+FN)(23+22=45). That means 199 out of 244 molecules are predicted correctly while 45 out of 244 are predicted incorrectly. image image

Hence from confusion matrix, we see that model is good.

AUROC curve: AUC and ROC curves are appropriate when the observations are balanced in between each class. As our dataset is nearly balanced so it's a good metric to observe the funtionality of our trained model. ROC curve shows the relationship between the true alarms (hit rate) and false alarms while AUC measures how good the model in separability of classes. Validation Case: Before visualizing the curves, the AUROC value we get in this case is 0.863 which is pretty good. Test Case: the AUROC value we get in this case is 0.876 which is also good. Now from the below curves we clearly see that, that ROC curve is much towards the 1 and the AUC is much higher which indicates that model is capable of separating Actives from InActive molecules.

image image

Precision & Recall: Precision is "how many retrieved items are relevant" while recall is "how many relevent items are retrieved".

Validation Case: Precision =0.8253 & Recall =0.7878 Test Case: Precision =0.8145 & Recall =0.8211

From above values of precision and recall we see that all values are closer to 1, which is good. As precision==1 (100%) shows that every positive prediction is correct with no False positive (FP==0) while recall==1 (100%) shows that every negative prediction is correct with no False negatives (FN==0)

Very good interpretation @ZakiaYahya !

ZakiaYahya commented 2 years ago

Hello @GemmaTuron, I have a question. Can we use round( ) functions to map predicted probabilities either to 1 or 0, is it okay to do that?? I'm drawing a confusion matrix and for this we have to convert predicted probabilities to labels. Kindly guide me on this? confusion

Hi @ZakiaYahya,

I use a for loop and iterate over the list to create a list with only 0 and 1. This allows you to specify the threshold of probability (otherwise you are always using the 0.5). Can you try this for loop and show me the code?

Ye sure, will let you know once i done.

ZakiaYahya commented 2 years ago

Interpretation of Model Evaluation using Performance metrics: Before going on the discussion of performance metrics we get on both validation set and test set, first we talk about balance or imbalance dataset. Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations i.e one class label has a very high number of observations and the other has a very low number of observations while in case of balanced dataset the both classes has nearly or almost equal number of observations. We first check our train set : (461-> Active) & ( 391 ->InActive). In our train set, we see that the data is neither imbalanced in nature, or not highly balanced as well but it's nearly or you can say almost balanced. So, the chances that the model get baised towards one class in training process is very low which is good and which is what we want in our classification task. Likewise, our validation set comprises (66-> Active) & (56 ->InActive) which is nearly balanced too and test set comprises of (123-> Active) & ( 121 ->InActive) which is almost balanced. Confusion Matrix: When we talking about classification problem, confusion matrix is a good start to check the performance of our trained model, As we see from the confusion matrices attached below, the number of True Positives(TP) and True negatives(TN) are much higher as compared to False positives(FP) and False Negatives(FN) which indicates our model is working good. Let's deep dive into the numbers we get in each case, Validation Case: Number of molecules correctly predicted (TP+TN)(52+45=97) and number of molecules predicted incorrectly (FP+FN)(11+14=25). That means 97 out of 122 molecules are predicted correctly while 25 out of 122 are predicted incorrectly. Test Case: Likewise in case of test set , Number of molecules predicted correctly (TP+TN)(101+98=199) and number of molecules predicted wrongly (FP+FN)(23+22=45). That means 199 out of 244 molecules are predicted correctly while 45 out of 244 are predicted incorrectly. image image Hence from confusion matrix, we see that model is good. AUROC curve: AUC and ROC curves are appropriate when the observations are balanced in between each class. As our dataset is nearly balanced so it's a good metric to observe the funtionality of our trained model. ROC curve shows the relationship between the true alarms (hit rate) and false alarms while AUC measures how good the model in separability of classes. Validation Case: Before visualizing the curves, the AUROC value we get in this case is 0.863 which is pretty good. Test Case: the AUROC value we get in this case is 0.876 which is also good. Now from the below curves we clearly see that, that ROC curve is much towards the 1 and the AUC is much higher which indicates that model is capable of separating Actives from InActive molecules. image image Precision & Recall: Precision is "how many retrieved items are relevant" while recall is "how many relevent items are retrieved". Validation Case: Precision =0.8253 & Recall =0.7878 Test Case: Precision =0.8145 & Recall =0.8211 From above values of precision and recall we see that all values are closer to 1, which is good. As precision==1 (100%) shows that every positive prediction is correct with no False positive (FP==0) while recall==1 (100%) shows that every negative prediction is correct with no False negatives (FN==0)

Very good interpretation @ZakiaYahya !

  • what is usually considered a good AUROC value? with which value would you be happy? (0.5, 0.7, 1 ...?)
  • I'll explain a bit more about precision and recall and then you can finish commenting on those metrics

Thanks @GemmaTuron. i'll do the suggested changes (i.e. incorporating for loop and barplot for binary data) and will let you know. Generally, the AUROC value closes to 1 considered as excellent which means a TP rate of 1 and a FP rate of 0. But, if AUROC values turned out to be 1 then following things should comes in your mind: "AUROC equal to 1 on the test set means either that your classifier managed to learn the task very well assuming that the test set is varied enough to decently represents the kind of samples your classifier will be used with in the future, or that your test data have same structure as your train data with no generalization at all or that your testing data leaked into your training data". So, in my guess, AUROC less than 1 like 0.7 is more accurately describing the performance of a trained model rather than AUROC with 1.

ZakiaYahya commented 2 years ago

@GemmaTuron I have incorporated the following changes in my notebook as suggested by you.

  1. I have incorporated for loop (instead of using round( ) function) for mapping probabilities into labels with threshold >=0.5, here's the patch of my code:

    labels_test = [int(prob>=0.5) for prob in y_pred_test] y_labels_test = labels_test

  2. I have plotted barplot for my binary data image

or you can access my notebook here

GemmaTuron commented 2 years ago

Hi @ZakiaYahya !!

Thanks, good job, I think we can close this issue and you can focus the rest of your time in making a good final application. Please add this also as your contribution in the outreachy website

ZakiaYahya commented 2 years ago

Hi @ZakiaYahya !!

Thanks, good job, I think we can close this issue and you can focus the rest of your time in making a good final application. Please add this also as your contribution in the outreachy website

Thanks @GemmaTuron. Sure, i'll start working on my final application.