ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
209 stars 135 forks source link

🦠 Model Request: Cyp2d6 substrate #449

Closed GemmaTuron closed 1 year ago

GemmaTuron commented 1 year ago

Model Title

Cyp2d6 substrate (TDCommons)

Publication

Hello @ting96haha!

As part of your Outreachy contribution, we have assigned you the dataset "Cyp2d6 substrate" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

ting96haha commented 1 year ago

Noted with thanks!

ting96haha commented 1 year ago

Hi @GemmaTuron

This is my approach to building a binary classification ML model. You may look at my entire source code from my created Google Colab notebook from this link: https://colab.research.google.com/drive/18Jwptnjp76h5QXRH5IOhidtF5Q8o9QuS#scrollTo=fJgFlXZv7BuE

First, in order to build an ML model, I researched the information on the dataset from this link: https://tdcommons.ai/single_pred_tasks/adme/

I discovered that the CYP2D6 Substrate dataset is under the ADME Metabolism category. The information from the website is extracted as follows:

ADME Definition: A small-molecule drug is a chemical it needs to travel from the site of administration (e.g., oral) to the area of action (e.g., a tissue) and then decomposes, and exits the body. To do that safely and efficaciously, the chemical must have numerous ideal absorption, distribution, metabolism, and excretion (ADME) properties. This task aims to predict various kinds of ADME properties accurately given a drug candidate's structural information.

Metabolism Definition: Drug metabolism measures how specialized enzymatic systems break down drugs and it determines the duration and intensity of a drug's action.

CYP2D6 Substrate, Carbon-Mangels et al. Dataset Description: CYP2D6 is primarily expressed in the liver. It is also highly expressed in areas of the central nervous system, including the substantia nigra. TDC used a dataset from [1], which merged information on substrates and nonsubstrates from six publications.

Task Description: Binary Classification. Given a drug SMILES string, predict if it is a substrate to the enzyme. Dataset Statistics: 664 drugs.

ting96haha commented 1 year ago

Therefore, from the extracted information, I understood that my task is to build an ML model which can classify the given CYP2D6 Substrate to be an active or inactive substrate to the enzyme.

The first step in building an ML model is to select the appropriate model to achieve our goal. As the outcome of the prediction is either active (1) or inactive (0), which is a binary and discrete (non-continuous) outcome, binary classification is chosen as the ML model.

In this assignment, we first train the machine learning model with the lazy-qsar model, which is the library that is customized AI/ML library to build QSAR models fastly and can be obtained from https://github.com/ersilia-os/lazy-qsar. The library can be installed as the command line below.

pip install git+https://github.com/ersilia-os/lazy-qsar.git

To load the required dataset, we use the Therapeutics Data Commons dataset, which provides rich information on biomedical entities. This information is carefully curated, processed, and readily available in TDC. The overall description of the dataset can be found here: https://tdcommons.ai/overview/

After loading the dataset with the command lines, I split the dataset into three categories which are train, test and validate

ting96haha commented 1 year ago

The analysis of the dataset is provided as follows:

The number of active and inactive molecules for each dataset category are provided as follows:

Train dataset:

Test dataset:

Validate dataset

The graph which depicts the number of active and inactive molecules is shown in the bar chart below: index

Inference: The number of inactive molecules is higher than that of active molecules.

ting96haha commented 1 year ago

An example of a drawing of the active molecule is shown below: active_molecule_drawing

An example of a drawing of the inactive molecule is shown below: Inactive_Molecule

ting96haha commented 1 year ago

I did not need to transform the dataset molecules to the computer in a format that the computer can understand (numerical vectors or images) through an alternative method since the lazy-qsar package automatically converts smiles to morgan's fingerprints. Hence, I can directly create lists of my smiles_train, y_train, smiles_test, y_test, smiles_valid and y_valid from the split dataset created earlier on.

To train the lazy-qsar ML model, I used the smiles_train and y_train datasets as the input to train the ML model through the fitting process of MorganBinaryClassifier(). The predicted Y for test datasets are provided as:

[0.35243926 0.23901749 0.34656639 0.25262714 0.33415051 0.33621443 0.30969589 0.30542669 0.31771505 0.42379903 0.35836092 0.28843371 0.27098791 0.32302126 0.27393859 0.35428157 0.44591214 0.28843371 0.27140095 0.22033926 0.28848668 0.28843371 0.26760603 0.24907308 0.32777918 0.33538618 0.26989623 0.28077928 0.27212867 0.42469968 0.29665645 0.22194475 0.38855845 0.27140095 0.45452166 0.25625241 0.36629957 0.36186854 0.25625241 0.26789276 0.36491036 0.30711474 0.26053276 0.38074398 0.28430969 0.21553459 0.39627147 0.2374786 0.32679111 0.28843371 0.27140095 0.4097485 0.30181501 0.27137047 0.33573065 0.35428157 0.28153044 0.22703909 0.32956918 0.28843371 0.45273953 0.32153924 0.25456143 0.42882641 0.33613252 0.35892211 0.2458129 0.31909566 0.26393377 0.25625241 0.28843371 0.3013505 0.34904592 0.2550959 0.22056572 0.28843371 0.34593317 0.33298289 0.28843371 0.35545064 0.23933209 0.27140095 0.21194061 0.17660745 0.46982568 0.24407186 0.26284567 0.32731807 0.31647416 0.44934005 0.3299135 0.26760603 0.2353426 0.2696599 0.34659576 0.26760603 0.22303449 0.29590878 0.47118413 0.31204919 0.3170019 0.28856004 0.27328518 0.36043521 0.31830045 0.35751738 0.31051717 0.3018384 0.29755559 0.2988268 0.24552067 0.38030879 0.26351483 0.29972292 0.15969457 0.28843371 0.28843371 0.26810292 0.33646817 0.29842821 0.27140095 0.3299135 0.24566715 0.24407186 0.29181091 0.24618919 0.24654062 0.23593578 0.2795629 0.44591214 0.24618919 0.28276233 0.28213633]

The predicted Y for validated datasets is provided as: [0.30594579 0.27245714 0.23643377 0.19410331 0.3299135 0.43027566 0.28849381 0.24932787 0.32997408 0.30945623 0.29890095 0.26989623 0.30578026 0.31830045 0.39572557 0.42929372 0.4408318 0.28843371 0.31469625 0.28134248 0.27328518 0.36621654 0.22233912 0.27140095 0.25359119 0.27157186 0.28843371 0.19357039 0.26284567 0.38371003 0.22703909 0.1578837 0.29555831 0.34943601 0.20374544 0.23034345 0.28843371 0.34402839 0.32658106 0.38748127 0.43087248 0.39303001 0.31982055 0.26008324 0.43777783 0.29588982 0.2401693 0.15969457 0.57884821 0.27649787 0.28839392 0.19444475 0.28843371 0.29033112 0.31736481 0.27649787 0.33937654 0.22703909 0.21609646 0.31368014 0.31059963 0.26789276 0.26284567 0.3019446 0.25625241 0.25335486 0.34067047]

The predicted Y value is not discrete as it shows the probability of a molecule being active. The closer the value of the predicted Y value to one, the higher its probability to be an active substrate. In this proposed model, the suggested probability threshold to be an active molecule is set to be more than 0.5.

Hence, the predicted Y for the validated dataset is provided as follows after transformation based on the prediction threshold: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Hence, the predicted Y for the test dataset is provided as follows after transformation based on the prediction threshold: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

From the test and validated dataset at the first glance, I can notice there are significantly more inactive molecules than active molecules in the dataset. The result is interpreted quantitatively through the following method.

ting96haha commented 1 year ago

Two measures are used to investigate the model performance quantitatively, which are

The ROC curve is shown as follows:

ROC_curve_validate

When we need to check or visualize the performance of the multi-class classification problem, we use the AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification model’s performance. It is also written as AUROC (Area Under the Receiver Operating Characteristics)

AUC - ROC curve is a performance measurement for classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease.

The determined AUROC value is 0.76

GemmaTuron commented 1 year ago

Hi @ting96haha ,

Can you comment on the contingency tables and the values you are getting? What could you do given that you have a good auc? Remember that the threshold for the contingency tables is automatically set up but can be changed if needed.

ting96haha commented 1 year ago

Hi @ting96haha ,

Can you comment on the contingency tables and the values you are getting? What could you do given that you have a good auc? Remember that the threshold for the contingency tables is automatically set up but can be changed if needed.

Sure, I will do it by today.

ting96haha commented 1 year ago

AUC is described as the area under the ROC curve, and it represents the performance of the probabilities from the positive classes to be separated from the negative classes of the dataset.

A ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

From the binary classes that we deploy earlier, the probabilities from the classifier are determined. The ROC curve is plotted through the sensitivity parameter (TPR) against its (1-specificity)(FPR) on a curve. The sensitivity parameter TPR, (True positive rate) is the proportion of the trues that are captured in the algorithm

Where True positive rate or TPR is just the proportion of trues we are capturing using our algorithm.

Sensitivty = TPR(True Positive Rate)= Recall = TP/(TP+FN)

while the False positive rate or FPR is just the proportion of false that are captured.

1- Specificity = FPR(False Positive Rate)= FP/(TN+FP)

A ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

Our work presents an AUC value of 0.76 for the validated dataset, which means the binary model has a prediction of 76% correct.

ting96haha commented 1 year ago

The following shows a general guideline to interpret AUC score:

However, an “excellent” AUC score differs per industry and shall be determined by the researcher.

For instance, the researcher shall aim for a higher AUC rating when the risk of incorrect is great. If there is a logistic regression model which is used to determine whether a patient will get cancer, and the cost of being inaccurate can be life-threatening, the AUC score shall be higher and must have models which predict correctly all the time. However, for some areas which do not require high sensitivity such as marketing, a low AUC value might be enough for a model. In this case, a model which is more than 0.5 might be helpful.

Since our classification is in the medical field, I shall always aim for a higher AUC value since the wrong classification might be dangerous to the end user. The model with the higher AUC score will be more reliable because it takes into account the predicted probability. It is more likely to give you higher accuracy when predicting future data. If the AUC value is not satisfactory or not achieving the targeted value, I shall choose a different model classifier to obtain a better model with a higher AUC value for better data classification.

ting96haha commented 1 year ago

A confusion matrix is a performance measurement for machine learning classification problems where the output can be two or more classes. A confusion matrix computed for the same test set of a dataset, but using different classifiers, can also help compare their relative strengths and weaknesses and draw an inference about how they can be combined to obtain the optimal performance. A binary class dataset is one that consists of just two distinct categories of data. These two categories can be named “positive” and “negative” for the sake of simplicity and can be represented in a table with 4 different combinations of predicted and actual values as below.

In our case, I selected 0.5 as the threshold probability to classify the molecule to be active or inactive. However, the value of the threshold can be changed if needed.

confusion_matrix_version_3

From the confusion matrix, the number of true positive, true negative, false positive and false negative are described as follows.

ting96haha commented 1 year ago

Accuracy is a metric that generally describes how the model performs across all classes. It is useful when all classes are of equal importance. It is calculated as the ratio between the number of correct predictions to the total number of predictions.

Precision is calculated as the ratio between the number of positive samples correctly classified to the total number of samples classified as Positive (either correctly or incorrectly). The precision measures the model's accuracy in organising a sample as positive. The precision reflects how reliable the model is in classifying samples as Positive. the precision is high when the model makes many correct Positive classifications (maximize True Positive and the model makes fewer incorrect Positive classifications (minimize False Positive). The goal of the precision is to classify all the Positive samples as Positive, and not misclassify a negative sample as Positive.

Recall is calculated as the ratio between the number of Positive samples correctly classified as positive to the total number of positive samples. The recall measures the model's ability to detect positive samples. The higher the recall, the more positive samples detected. The recall cares only about how the positive samples are classified. This is independent of how the negative samples are classified, e.g. for precision. When the model classifies all the positive samples as Positive, then the recall will be 100% even if all the negative samples were incorrectly classified as Positive.

The difference between precision and recall is that precision considers when a sample is classified as Positive, but it does not care about correctly classifying all positive samples. The recall cares about correctly classifying all positive samples, but it does not care if a negative sample is classified as positive.

In a nutshell, as our model has low recall but high precision, our model is accurate when it classifies a sample as Positive. However, it can only classify a few positive samples and classify most positive molecules wrongly as having negative values.

If the goal is to detect all the positive samples (without caring whether negative samples would be misclassified as positive), then use recall. Use precision if the problem is sensitive to classifying a sample as Positive in general, i.e. including Negative samples that were falsely classified as Positive. The decision to use the precision or recall value shall be discussed among researchers to identify whether is it more important to determine all the positive molecules in the sample (use recall), or to classify the sample to be positive or not (use precision).

ting96haha commented 1 year ago

I can set a threshold value to classify all the values greater than the threshold as 1 and lesser than that as 0. Since the predicted result are mostly classified as negative, it might be better if I adjust the threshold value to be lower than the default 0.5 to see the changes on the confusion matrix parameter.

After the threshold value is set to a lower value at 0.4, the number of true positive, true negative, false positive and false negative are described as follows.

confusion_matrix_version_2

False positives increase and false negatives decrease. As a result, this time, precision decreases and recall increases:

Hence, the threshold can be altered to reach our desired precision and recall value.

GemmaTuron commented 1 year ago

Hello @ting96haha , good job on the modelling and explaining the results, seeing that the model does not do the best job in predicting actives I would increase training times. But let's close this as completed for now and go onto finishing your outreachy contributions and final application, thanks!