🦠 Model Request: Tox21 NR-AR-LBD

GemmaTuron commented 2 years ago

Model Title

Tox21 NR-AR-LBD

Publication

Hello @Yayeks!

As part of your Outreachy contribution, we have assigned you the dataset "Tox21 NR-AR-LBD" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

Yayeks commented 2 years ago

Hello @ Gemma, which is metrics should be given more relevance, precision or recall? For the ROC Curve which is better roc)curve or RocCurveDisplay?

Please, check out my progress(https://colab.research.google.com/drive/1nO8MsCvqB8zzdnunWs6At-IfpvqS1Mud?usp=sharing)

GemmaTuron commented 2 years ago

Hi @Yayeks !

Good start, some suggestions:

Instead of a histogram, simply a pie chart or barplot with actives and inactives would be more appropriate to represent the data
Can you comment in this issue a bit more about your data? what are we modelling? how is your data distribution in classes, do you think it will be a difficult or easy task to model, and why?
The format to save the model should be a joblib: model.joblib

About your questions:

Precision and recall can be improved, both, to a certain extent, then the model either has a higher precision (all the actives identified are real actives) or higher recall (we don't want to miss any putative active, even if we get false positives) --> more on this on the slack channel
Roc Curve and RocCurveDisplay are simply two functions to plot a roc curve, so the same.

Yayeks commented 2 years ago

Okay, thank you for your suggestions @GemmaTuron, i will implement it.

Yayeks commented 2 years ago

Hello @GemmaTuron, i have taken note and put into use, some of the suggestions you mentioned earlier.

So the idea is to build a binary classification ML model using MorganBinaryClassifier on the dataset "Tox21 NR-AR-LBD". After analyzing the model, it was discovered that we have an imbalanced dataset. So we are dealing with an imbalanced classification problem. Most often, imbalanced problem are difficult to model. More information are in the Colab notebook attached. Yayeks

Yayeks commented 2 years ago

@GemmaTuron this the above link shows my progress so far, your suggestions/corrections are highly welcome. Thank you.

GemmaTuron commented 2 years ago

Hi @Yayeks !

Good Job, to conclude the task you could make a summary of your findings here for everyone?

Yayeks commented 2 years ago

Problem

In this issue, we focus on building a predictive classification ML model using the Therapeutics Data Commons (TDC) datasets. I was assigned a TOX21 dataset which contains qualitative toxicity measurements for 7,831 compounds.

Simply put, we are going to be predicting toxicity outcomes of drug treatments in humans. The specifics of which it predicts if a drug will activate nuclear receptor androgen receptor ligand-binding domain which can go into the nucleus of the cell and bind the DNA, activating and inhibiting genes which can lead to toxicity.

We would we taking the following steps in order to achieve our aim

Load our data
Analyse our data
Visualization
Preparation of Data
Train the model
Model Prediction
Evaluation of Model

Load your data

Our dataset Tox21 NR-AR_LBD was gotten from TDC website, we start by installing TDC and all necessary packages we might need. Since the Tox21 dataset consists of 12 different targets, we first load the whole Tox21 dataset and then select our specific assay (NR-AR-LBD). Our dataset consists of 6,758 drugs.

We then proceeded to split our datasets into using the TDCommons get_split method which resulted in the Train, Validation and Test sets.

Train set: Training datasets are fed/inputted to machine learning algorithms to teach them how to make predictions or perform a desired task. They usually make up about 70% of the total/original dataset.
Validation set: A sample of data held back from training your model that is used to give an estimate of model skill while tuning model’s hyperparameters. They usually make up about 10% of the total/original dataset.
Test set: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. They usually make up about 20% to 30% of the total/original dataset. We chose to save the resulting datasets as pandas dataframes to google drive for easy retrieval.

Analyse your data

The next step is to get an understanding of our data. In our TOX 21 NR-AR-LBD datasets, we have two features which are the Drug and the Drug ID, and one target variable which specifies if the drug is inactive/non-toxic indicated by a 0 or active/toxic which is indicated by a 1.

We have

4730 molecules in the train set, 4554 of which are inactive(non-toxic) while 176 are actives(toxic)
676 molecules in the validation set, 656 of which are inactive(non-toxic) while 20 are actives(toxic)and
1352 datasets in the test set, 1311 of which are inactive(non-toxic) while 41 are actives(toxic)

From the above, we can see that the number of inactives are much more than the number of actives, the dataset is unevenly distributed, such dataset are called imbalanced dataset. This leads to an imbalanced classification, Imbalanced classification problem is hard to model ☹ because the class distribution is not balanced.

Visualization

It is said that the human brain processes visual content 60,000 times faster than text. That means that a picture is actually worth 60,000 words. So we choose to visualize the data to get a better understanding of the numbers using the matplotlib package to plot the data outcomes. pie_bar_chart We can see how miniscule the actives are represented in the large scheme of things.

RDKIT package is used to visualize and have an idea of what an active / inactive molecule looks like. It is a useful visualization tool for drawing chemical molecules.

GemmaTuron commented 2 years ago

Hi @Yayeks

Can you explain why are you using this:

# convert our predicted values for validation to discrete values using the optimum threshold
Y_disc_predict_val = np.where(Y_predict_val > thrsh_score, 1, 0)

As input for the fpr and tpr calculation?

Yayeks commented 2 years ago

Preparation of Data

We splitted our data into a feature list and target list. The lazy-qsar package is used because it automatically converts molecules to morgan fingerprints. Morgan fingerprint is the most popular molecular fingerprint. Molecular fingerprints are essential cheminformatics tools for virtual screening and mapping chemical space.

Train our ML model

This is a phase in the data science development lifecycle where practitioners try to fit the best combination of weights and bias to a machine learning algorithm in order to minimize a loss function over the prediction range. The purpose of model training is to build the best mathematical representation of the relationship between data features and a target label.

We installed the lazy-qsar package and imported the MorganBinaryClassifier to be used for classification. Note: We used a classification technique because the target variable is discrete

The performance of the model determines the quality of the applications that are built using it. We use the train set to train the model. We do this by instantiating the model and fitting the features and target variable into it. It keeps iterating till it finds the best fit and the retrained model which in our case is RandomForestClassifier(criterion='entropy', max_features=0.0635427281524705, max_leaf_nodes=15, n_estimators=10, n_jobs=-1)

We saved our model in joblib format because joblib is faster in saving/loading large NumPy arrays and for easy retrevial from Google Drive.

Model Prediction

We then proceeded to model prediction using our trained model. We make predictions for both the test and validation set using the predict_proba method. The outcome returns continuous values, we would have to convert it to discrete values based on certain threshold of probability. The default threshold is 0.5 but since it is an imbalanced class problem, this default threshold may not work properly. We used ROC Curves to find the optimal threshold for the classifier then we converted our predicted values to discrete values using the optimum threshold.

Evaluation

Here we test the results of our prediction, we use a different dataset (excluding the train set) to evaluate the performance of the model. We imported and used some necessary functions from scikit learn, otherwise known as sklearn for evaluating classification models.

There are four possible outcomes that could occur when performing classification models:

True positive: is an outcome where the model correctly predicts the positive class meaning our model predicts a drug as inactive(non-toxic) and its correct prediction is the same (non-toxic). Right prediction.
True negative: is an outcome where the model correctly predicts the negative class meaning our model predicts a drug as active(toxic) and its correct prediction is the same (toxic). Right prediction.
False positive: is an outcome where the model incorrectly predicts the positive class meaning our model predicts a drug as inactive(non-toxic) whereas the drug is active(toxic). Wrong prediction.
False negative: is an outcome where the model incorrectly predicts the negative class meaning our model predicts a drug as active(toxic) whereas the drug is inactive(non-toxic). Wrong prediction.

This is better shown using the confusion matrix Below is the confusion matrix for validation set ConfusionMatrix_Validation

The AUC is 0.7219512195121951 ROC_Validation

F1-score which is also a measure of a model's accuracy on a dataset is used to evaluate the validation dataset is 0.5454545454545455 while the precision and recall are both 0.6923076923076923 and 0.45 respectively.

Below is the confusion matrix for test set ConfusionMatrix_Test

The AUC for test set is 0.753809231456159 ROC_Test

F1-score which is also a measure of a model's accuracy on a dataset is used to evaluate the test dataset is 0.6176470588235294 while the precision and recall are both 0.7777777777777778 and 0.5121951219512195 respectively.

Recommendation

I think the model performance can be improved by resampling the data, we can oversample the minority class using replacement. This technique is called oversampling, however doing so can lead to over-fitting. We can undersample the majority class by removing some of it, this technique is called undersampling but this could lead to information loss. So, i suggest we combine both undersampling and oversamplying techniques. After sampling the data we can get a balanced dataset for both majority and minority classes. So, when both classes have a similar number of records present in the dataset, we can assume that the classifier will give equal importance to both classes and hereby enhance the prediction.

Yayeks commented 2 years ago

The link to the work done (Colab notebook) is Yayeks Outreachy Contribution

Yayeks commented 2 years ago

Hi @Yayeks

Can you explain why are you using this:
# convert our predicted values for validation to discrete values using the optimum threshold
Y_disc_predict_val = np.where(Y_predict_val > thrsh_score, 1, 0)
As input for the fpr and tpr calculation?

That is so because I converted the continuous values to a discrete ones(0 and 1s only) and used it as the prediction section of the roc_curve input

Chigoziee commented 2 years ago

Great job @Yayeks, very good work!!! why don't you try subsampling your dataset. this will help you a lot in your results. from your pie chart it shows that your active labels from the dataset only amounts to 3.7% of the entire dataset. subsampling will help improve your models ability to predict active labels

Yayeks commented 2 years ago

Thanks @Chigoziee , that is a good idea but wouldn't that result in loss of information from the inactives seeing as I would be reducing it a lot?

Ng-ethe commented 2 years ago

@Yayeks fit your data into the Morgan Binary Classifier or the Random Forest Classifier. I see in the quoted text you used the Random Forest Classifier.

Yayeks commented 2 years ago

@Ng-ethe The data is fitted into the MorganBinaryClassifier which in turn uses the Random Forest Classifier through AUTOML.

Chigoziee commented 2 years ago

It wouldn't necessarily reduce how well your model learns to predict inactives, it'll just make it less biased in it's predictions. In the sense that it'll make better predictions for the active class. Nevertheless, your results now are still very good.

Yayeks commented 2 years ago

Okay @Chigoziee I will try that out.

GemmaTuron commented 2 years ago

Hello @Yayeks !

Thanks for your contributions to Ersilia during the Outreachy Application Period! We hope you have learnt and enjoyed as much as the Ersilia team did! I will close this issue as this was part of the application period solely.

ersilia-os / ersilia