ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
219 stars 147 forks source link

🦠 Model Request: Cyp3a4 Substrate #422

Closed GemmaTuron closed 2 years ago

GemmaTuron commented 2 years ago

Model Title

Cyp3a4 Substrate (TDC dataset)

Publication

Hello @EstherIdabor!

As part of your Outreachy contribution, we have assigned you the dataset "cyp3a4 substrate" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

EstherIdabor commented 2 years ago

Thank you @GemmaTuron , on it google colab

EstherIdabor commented 2 years ago

The steps I am taking/have taken in building a ML model to predict if a drug is a substrate of CYP3A4

1) Extensive research to gain an understanding of CYP3A4 enzyme: Pharmacokinetics(ADME)(how the body interacts with administered drugs) is defined in terms of site of administration, distribution to different part of the body, metabolism(how the drug is metabolized into harmless/inactive metabolites) and excretion. In metabolism the focus is on how specialized enzymatic system breakdown drugs because this influences the duration of the drug action on the body as well as the intensity/concentration of the drug at a given time. CYP3A4 is an enzyme under the CYP450 superfamily of enzymes found in the liver and intestine that oxidizes drugs or toxins so that they can be removed from the body. Building a model that predict if a drug is metabolized/ a substrate of CYP3A4 is building a model that can be used to predict if the duration of drug action and and concentration is affected by the CYP3A4 enzyme.

2 )Data Upload; In this step I installed the TDC(therapeutics Data Commons) package, then I imported ADME and got the specific assay assigned to me, I split the dataset using the get_split( ) method in to train, validate and test which will be used for training, validating and testing the model respectively. I then proceed to save each dataset on my drive.

EstherIdabor commented 2 years ago
  1. Data Analysis i. Understanding my data

After splitting my dataset into three datasets(train, validate, test) and ensuring they are saved to my drive, I then took a critical look at the dataset. By using the len() function, I got the value of the molecules in each dataset. In the train set they are a total number of 469 molecules, in the validation set, they are a total number of 67 molecules, in the test dataset, they are a total number of 134 molecules. I proceeded to get the number of actives in this case drugs that are a substrate of CYP3A4 and those that are not.

GemmaTuron commented 2 years ago

Good start @EstherIdabor !

EstherIdabor commented 2 years ago

ii. visualizing my data

after researching on the plot best suited for visualizing discrete(categorical) data as you asked on the slack channel @GemmaTuron. I discovered discrete data are best visualized with bar chart, pie chart and frequency plot. I went ahead to visualize my train data using matplotlib package to draw a bar chart and pie chart.

from the images one can see that the actives(substrate of cyp3a4) are more than the inactives(not-substrate), though the difference is not so large

download-barchart

pieChart representation download-piechart

EstherIdabor commented 2 years ago

Good start @EstherIdabor !

Thank you

EstherIdabor commented 2 years ago

Using RDKit package I was able to visualize the molecular image of active(substrates) and inactive(non-substrate).

Actives download-rdkit-image

Inactives rdkit-inactive-image

EstherIdabor commented 2 years ago

Preparation for Modelling For convenience in working with the datasets, I converted the smiles(drugs) as well the "Y" (outcome/target variable) to a list

Model Training Before training, I tried to make sense of what the process is all about to have a working knowledge of the steps I will be taking so I understood it in this way; In training a model the algorithm tries to match the smiles(drugs) with the outcome/target variables(Y) and any molecule/drug that looks almost exactly as the smile it associate with its target variable. The smiles(drug) is our training data but its not something that the computer understands because computer as well as machine learning algorithms works with numbers, so we need to convert smiles (list of string) into numbers(0 and 1)) , morgan fingerprints is one of the easiest to use, it convert the smiles to a list of vectors. Now the Lazy-QSAR which is what is to be used in training the model does a lot of abstraction in order to simplify the steps in training the model, it also comes pre-installed with morgan fingerprint for converting our drugs to a vectors i.e a list of 0 and 1 interpretable by the model The training process is called FIT. Steps I took to train the model

GemmaTuron commented 2 years ago

Hi @EstherIdabor ! Good job, can you add some comments about the performance of your models?

EstherIdabor commented 2 years ago

Hi @EstherIdabor ! Good job, can you add some comments about the performance of your models?

Thank you @GemmaTuron I am on it, because i am very new to machine learning, I am trying hard to understand every step I should take

Femme-js commented 2 years ago

Hi @EstherIdabor !

Just to simplify the way in which you can interpret your model, here are some points you can look into:

I hope this can help you.

EstherIdabor commented 2 years ago

Evaluation of My Model

The validation set and test set were set aside and not used during the training process in order evaluate the performance of my model, that is they are unbiased dataset and the model does not have any memory of them, by obtaining predictions using the validation and test set, I was able to apply various performance metrics on it and therefore judged how it performed

Using the predict_proba function, I got the probability of the prediction of the validation and test molecules being 1(substrate) from whence the probability of the prediction not being 1 i.e 0 can also be ascertained.

Evaluation of the model was done using the following metrics;

Consequent from the above the following libraries was used;

AUROC CURVE

The ROC curve plots the the true positive rate (TPR) versus the false positive rate (FPR) at different classification thresholds, TRP is the proportion of positive data points that are correctly predicted as positive. FPR is the proportion of negative data points that are incorrectly predicted as positives. The higher the AUC of the model(closer to 1 e.g 0.9) the better the performance of the model in distinguishing between the positive and negative classes. the lesser the AUC or farther away from 1, the lesser the performance of the model and poor at discrimination

values between 0.9-1 is considered excellent, between 0.8-0.9 is considered good, between 0.7-0.8 is considered fair and between 0.6-0.5 is considered poor performance.

performance on the validation set val_auroc

AUROC Value for the validation set = 0.6011 the auc value being 0.6011 means the model is performing poorly because it is almost near that of a random classifier and it is poor at dicrimination

performance on the test set test_auroc AUROC Value for the test set = 0.508 the auc for the test is low as well indicating the performance of the model is poor

Updated version By researching I discovered that low auroc value indicating poor performance could be as a result of imbalanced dataset or low training time. my dataset isn't imbalanced so I figured it is not getting enough training time and increased the training time to 1800sec which improved the model and made the auc value go up to 0.77 and 0.608 Validation set auroc_updated(1)

Test set test_auroc_updated

Confusion Matrix

This a table that show four values TP, TN, FP, FN upon which we are able to get precision and recall from

performance on the validation set contigencymatrix_val The validation dataset is a total of 67 drugs(smiles)

TP: 30 molecules classified by the model as being substrate are actually substrate _FP: 20 molecules were classified by the model as being substrate when they are actually not substrate TN:_ 8 molecules were classified by the model as being non-substrate and are actually non-substrate FN**: 9 molecules were classified by the model as being non-substrate when they are actually substrate

Precision: Precision is defined as the ratio of correctly classified positive samples (True Positive) to a total number of classified positive samples(either correctly or incorrectly), hence precision helps us to visualize the reliability of the machine learning model in classifying the model as positive. Precision = TP/TP+FP

Recall The recall is calculated as the ratio between the numbers of positive samples correctly classified as Positive to the total number of Positive samples. The recall measures the model's ability to detect positive samples. The higher the recall the more positive sample detected. Recall = TP/TP+FN

the validation set has a precision of 0.7894 and a recall of 0.769

performance on the test set contigencymatrix_test

The test dataset is a total of 134 drugs(smiles) TP: 60 molecules classified by the model as being substrate are actually substrate _FP: 59 molecules were classified by the model as being substrate when they are actually not substrate TN:_ 8 molecules were classified by the model as being non-substrate and are actually non-substrate FN**: 7 molecules were classified by the model as being non-substrate when they are actually substrate

the test dataset has a precision of 0.504 and a recall of 0.895

The false negative are quite significant which would be a bad sign when dealing with toxicity dataset, one way I think I can control this is to reduce the threshold

EstherIdabor commented 2 years ago

Hi @EstherIdabor !

Just to simplify the way in which you can interpret your model, here are some points you can look into:

  • Time taken by the model to train
  • Area under ROC curve and the range in which it lies, if it's greater than 0.7 your model is fairly good.
  • Checking the number of False Positives, Precision and Recall
  • What threshold you chose to covert predicted probabilities into classes and why?

I hope this can help you.

@Femme-js Thank you!

Zainab-ik commented 2 years ago

@EstherIdabor, The change in AUC is a very significant increase, did that increase occur in the validation set or test set?

EstherIdabor commented 2 years ago

@EstherIdabor, The change in AUC is a very significant increase, did that increase occur in the validation set or test set?

It occurred in the validation set, though I am trying to reason out why it didn't occur like that in the test set

EstherIdabor commented 2 years ago

CONCLUSION

Why is this prediction important? Prediction of CYP3A4 substrate is becoming an integral part of drug research, because its knowledge is important in preventing drug-drug interaction, and necessary for drug dosing. Administration of cyp3a4 substrate drug with a cyp3a4 inhibitor drug decreases an individual dosage requirement of a cyp3a4 drug. Missing this step in drug research can lead to unintended consequence, this is because administering drug without the knowledge of biotransformation profile can result in the serum level of these drugs reaching a toxic state, the toxicity can manifest itself with serious medical consequence. and can cause withdrawal from the market.

That said, In enhancing the performance of my model, I took into consideration the training time, as it is necessary to ensure the model is well trained or fitted to make good predictions, I also recognize that different learning algorithm can affect the performance as well, depending on the dataset. Altering the threshold can also help minimize false predictions. They are a list of exhaustive metrics but I consider the above best suited to judge the performance of my model.

Zainab-ik commented 2 years ago

CONCLUSION

Why is this prediction important? Prediction of CYP3A4 substrate is becoming an integral part of drug research, because its knowledge is important in preventing drug-drug interaction, and drug dosing. Administration of cyp3a4 substrate drug with a cyp3a4 inhibitor drug decreases an individual dosage requirement of a cyp3a4 drug. Missing this step in drug research can lead to unintended consequence, this is because administering drug without the knowledge of biotransformation profile can result in the serum level of these drugs reaching a toxic state, the toxicity can manifest itself with serious medical consequence. and can cause withdrawal from the market.

That said, In enhancing the performance of my model, I took into consideration the training time, as it is necessary to ensure the model is well trained or fitted to make good predictions, I also recognize that different learning algorithm can affect the performance as well, depending on the dataset. Altering the threshold can also help minimize false predictions. They are a list of exhaustive metrics but I consider the above best suited to judge the performance of my model.

Beautiful conclusion @EstherIdabor I'd say. I'd like to add that CYP3A4 is an enzyme involved in drug metabolism but not all classes of drugs. In as much as its knowledge is important to prevent dangerous DDI (drug-drug interaction) and even DFI (drug-food interaction). It's also important to note that it can enhance DDI thereby increasing the therapeutics effects of drugs.

You did a good job.

GemmaTuron commented 2 years ago

Hi @EstherIdabor !

thanks for the good job and for re-thinking and trying to improve the performance. I'll close this as completed and let you focus on your final application!