Closed GemmaTuron closed 2 years ago
Hi @GemmaTuron !
This is the link to my notebook, I would be completing steps for this task, and adding interpretations about each step through comments in the colab, along with keep updating you with progress via this issue.
Thank You!
Hi @GemmaTuron! I saw that Solubility_AqSolDB is a regression task instead of a binary task. Can you confirm if I am referring to the same dataset?
Dataset Overview: _SolubilityAqSolDB
Aqueous solubility measures a drug’s ability to dissolve in water. Poor water solubility could lead to slow drug absorptions, inadequate bioavailability, and even induce toxicity. More than 40% of new chemical entities are not soluble (Savjani et al. 2012). This dataset is collected from AqSolDb (Sorkun et al. 2019), which contains 9,982 drugs curated from 9 different publicly available datasets.
*Suggested Data Split and Evaluation according to Therapeutics Data Commons Publication Scaffold Split and MAE.
Learning Tasks : single_pred.ADME
Loaded the Solubility Dataset (Solubility_AqSolDB) from Therapeutics Data Commons using PyTDC module.
Solubility_AqSolDB is one of the assays among ADME datasets, which was retrieved by importing ADME single prediction task data loader.
What are we trying to predict?
For a small-molecule drug to travel from the site of administration to the site of action safely and efficaciously, it needs to have ADME ( ideal absorption, distribution, metabolism, and excretion) properties.
The aqueous Solubility dataset consists of 9982 drugs ( DrugId and SMILES) with water solubility logarithmic values in mol/L.
We are trying to predict the solubility of a drug SMILES string using a binary classifier.
We divide the loaded data between train/test/validation sets which are already split through the pyTDC module.
Total Data Samples in, Train set: 6988 Validation set: 998 Test set: 1996
Distribution of train set labels Y before normalization.
We have normalized the labels column to bring it on a common scale between 0 and 1.
Train Set Label Distribution after Normalization.
We have used the mean of the distribution to separate labels into binary classes (Soluble:1 and Insoluble:0)
Number of Soluble and Insoluble samples in each set Train Set Soluble: 3913 Insoluble: 3075
Validation Set Soluble: 444 Insoluble: 554
Test Set Soluble: 1170 Insoluble: 826
Train Labels Count using Matplotlib
Molecular structure of a soluble and insoluble molecule.
Soluble Drug
Insoluble Drug
Here, the input data from the other model is the drug's SMILE structure. For a machine learning model, the input data needs to be passed into a numeric/vector format. We are using the LazyQSAR AutoML module, which can preprocess the SMILES structure into Morgan Fingerprints and can be used to fit the data into Morgan Binary Classifier.
Model Training Time: 30.9 seconds
Metrices used to measure the performance of the model are ROC Curve and Confusion Matrix.
ROC curves are best used when we have balanced classes. In this analysis, we had closely balanced train/val/test sets.
Morgan Binary Classifier predicted the probability score for class values. To interpret the predicted probabilities we are free to calibrate the threshold. In this case, we have used the default threshold of 0.5.
When making a prediction for a binary classification problem, there are two types of errors that we could make.
False Positive. Predict an event when there was no event. False Negative. Predict no event when in fact there was an event.
By predicting probabilities and calibrating a threshold, a balance of these two concerns can be chosen.
ROC Curve
ROC curves plot the false positive rate (x-axis) versus the true positive rate (y-axis).
The area under the ROC curve, known as AUC, ranged from 0 to 1. According to general interpretation, an area > 0.7 indicates a Fair Model, and > 0.9 indicated the model is excellent.
_AUC_ROC for Validation set : 0.879_
ROC with Validation Set
Confusion Matrix
It is useful for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC curves, using TP(True Positives), FP (False Positives), FN (False Negatives), TN (True Negatives).
Confusion Matrix with Validation Set
Precision of Validation Set: 0.70 Recall of a Validation Set: 0.89
The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
The recall is intuitively the ability of the classifier to find all the positive samples.
Evaluation on Test Set
Area Under ROC for Test Set : 0.9
Auc_Roc for Test Set
Confusion Matrix with Test Set
Precision of Test Set: 0.84 Recall of Test Set: 0.86
The high value for AUC_ROC for the test i.e. 0.9 indicates the model to be excellent.
Hi @Femme-js the visualisation of your data looks pretty nice, I'm guessing you used histogram because it's a continuous data, but I'm trying to understand how you were able to build a binary classification mL model from a continuous data
HI @EstherIdabor ! I first check the distribution of the data, and whether it is normally distributed or not. There 3-4 outliers which were making dataset skewed, so I normalized them under the range 0 to 1, and took mean as central tendency to divide the smiles as soluble and non-soluble. As solubility values were the logs of mol/L, that basically means solubility was relative here. So we can give any threshold to divide two classes on which value we consider the drug to be soluble enough. In this case, I chose to go with the data central tendency.
Hi @Femme-js !
Very good job, you had a complex task. To determine thresholds in experimental chemistry, we can use your approach, or, if we have contact with the person who did the actual experiment, ask for their guidance on what they consider a good threshold. Well done, I'll mark this as completed and you can focus on finishing the application!
Great job @Femme-js I like the way you converted a regression dataset/problem into a classification task by setting a cutoff value for the label which you then used to group the drugs into 2 class: soluble and insoluble. Being a classification problem seems more suited than been a regression problem since the whole idea is to identify which drugs dissolve better in water. nice work @Femme-js
Model Title
Solubility (TDC dataset)
Publication
Hello @femme-js!
As part of your Outreachy contribution, we have assigned you the dataset "Solubility" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.
Code
No response