ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
224 stars 148 forks source link

🦠 Model Request: Solubility #423

Closed GemmaTuron closed 2 years ago

GemmaTuron commented 2 years ago

Model Title

Solubility (TDC dataset)

Publication

Hello @femme-js!

As part of your Outreachy contribution, we have assigned you the dataset "Solubility" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

Femme-js commented 2 years ago

Hi @GemmaTuron !

This is the link to my notebook, I would be completing steps for this task, and adding interpretations about each step through comments in the colab, along with keep updating you with progress via this issue.

Thank You!

Femme-js commented 2 years ago

Hi @GemmaTuron! I saw that Solubility_AqSolDB is a regression task instead of a binary task. Can you confirm if I am referring to the same dataset? Screenshot from 2022-10-26 23-39-15

Femme-js commented 2 years ago

Dataset Overview: _SolubilityAqSolDB

Aqueous solubility measures a drug’s ability to dissolve in water. Poor water solubility could lead to slow drug absorptions, inadequate bioavailability, and even induce toxicity. More than 40% of new chemical entities are not soluble (Savjani et al. 2012). This dataset is collected from AqSolDb (Sorkun et al. 2019), which contains 9,982 drugs curated from 9 different publicly available datasets.

*Suggested Data Split and Evaluation according to Therapeutics Data Commons Publication Scaffold Split and MAE.

Learning Tasks : single_pred.ADME

Femme-js commented 2 years ago

Steps to Build the Model:

  1. Loading the data from Therapeutics Data Commons
  2. Data Analysis and Visualization
  3. Scaling the data and covert the labels for the classification task
  4. Data Preprocessing and Training using LazyQSAR
  5. Model Evaluation on Test Data and Validation Data

Loading The Data

Loaded the Solubility Dataset (Solubility_AqSolDB) from Therapeutics Data Commons using PyTDC module.

Solubility_AqSolDB is one of the assays among ADME datasets, which was retrieved by importing ADME single prediction task data loader.

What are we trying to predict?

For a small-molecule drug to travel from the site of administration to the site of action safely and efficaciously, it needs to have ADME ( ideal absorption, distribution, metabolism, and excretion) properties.

The aqueous Solubility dataset consists of 9982 drugs ( DrugId and SMILES) with water solubility logarithmic values in mol/L.

We are trying to predict the solubility of a drug SMILES string using a binary classifier.

Data Analysis and Visualization

We divide the loaded data between train/test/validation sets which are already split through the pyTDC module.

Total Data Samples in, Train set: 6988 Validation set: 998 Test set: 1996

image Distribution of train set labels Y before normalization.

We have normalized the labels column to bring it on a common scale between 0 and 1.

image Train Set Label Distribution after Normalization.

We have used the mean of the distribution to separate labels into binary classes (Soluble:1 and Insoluble:0)

Number of Soluble and Insoluble samples in each set Train Set Soluble: 3913 Insoluble: 3075

Validation Set Soluble: 444 Insoluble: 554

Test Set Soluble: 1170 Insoluble: 826

image Train Labels Count using Matplotlib

Molecular structure of a soluble and insoluble molecule.

Soluble Drug image

Insoluble Drug image

Data Preprocessing and Model Training

Here, the input data from the other model is the drug's SMILE structure. For a machine learning model, the input data needs to be passed into a numeric/vector format. We are using the LazyQSAR AutoML module, which can preprocess the SMILES structure into Morgan Fingerprints and can be used to fit the data into Morgan Binary Classifier.

Model Training Time: 30.9 seconds

Model Testing and Validation

Metrices used to measure the performance of the model are ROC Curve and Confusion Matrix.

ROC curves are best used when we have balanced classes. In this analysis, we had closely balanced train/val/test sets.

Morgan Binary Classifier predicted the probability score for class values. To interpret the predicted probabilities we are free to calibrate the threshold. In this case, we have used the default threshold of 0.5.

When making a prediction for a binary classification problem, there are two types of errors that we could make.

False Positive. Predict an event when there was no event. False Negative. Predict no event when in fact there was an event.

By predicting probabilities and calibrating a threshold, a balance of these two concerns can be chosen.

ROC Curve

ROC curves plot the false positive rate (x-axis) versus the true positive rate (y-axis).

The area under the ROC curve, known as AUC, ranged from 0 to 1. According to general interpretation, an area > 0.7 indicates a Fair Model, and > 0.9 indicated the model is excellent.

_AUC_ROC for Validation set : 0.879_

image ROC with Validation Set

Confusion Matrix

It is useful for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC curves, using TP(True Positives), FP (False Positives), FN (False Negatives), TN (True Negatives).

image Confusion Matrix with Validation Set

Precision of Validation Set: 0.70 Recall of a Validation Set: 0.89

The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is intuitively the ability of the classifier to find all the positive samples.

Evaluation on Test Set

Area Under ROC for Test Set : 0.9

image Auc_Roc for Test Set

image Confusion Matrix with Test Set

Precision of Test Set: 0.84 Recall of Test Set: 0.86

Interpretation

The high value for AUC_ROC for the test i.e. 0.9 indicates the model to be excellent.

Femme-js commented 2 years ago

@GemmaTuron

Here is the link for colab.

Looking forward for your feedback.

EstherIdabor commented 2 years ago

Hi @Femme-js the visualisation of your data looks pretty nice, I'm guessing you used histogram because it's a continuous data, but I'm trying to understand how you were able to build a binary classification mL model from a continuous data

Femme-js commented 2 years ago

HI @EstherIdabor ! I first check the distribution of the data, and whether it is normally distributed or not. There 3-4 outliers which were making dataset skewed, so I normalized them under the range 0 to 1, and took mean as central tendency to divide the smiles as soluble and non-soluble. As solubility values were the logs of mol/L, that basically means solubility was relative here. So we can give any threshold to divide two classes on which value we consider the drug to be soluble enough. In this case, I chose to go with the data central tendency.

GemmaTuron commented 2 years ago

Hi @Femme-js !

Very good job, you had a complex task. To determine thresholds in experimental chemistry, we can use your approach, or, if we have contact with the person who did the actual experiment, ask for their guidance on what they consider a good threshold. Well done, I'll mark this as completed and you can focus on finishing the application!

Chigoziee commented 2 years ago

Great job @Femme-js I like the way you converted a regression dataset/problem into a classification task by setting a cutoff value for the label which you then used to group the drugs into 2 class: soluble and insoluble. Being a classification problem seems more suited than been a regression problem since the whole idea is to identify which drugs dissolve better in water. nice work @Femme-js