ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
202 stars 128 forks source link

🦠 Model Request: Tox21 NR-AR-LBD #433

Closed GemmaTuron closed 1 year ago

GemmaTuron commented 1 year ago

Model Title

Tox21 NR-AR-LBD

Publication

Hello @Yayeks!

As part of your Outreachy contribution, we have assigned you the dataset "Tox21 NR-AR-LBD" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

Yayeks commented 1 year ago

Hello @ Gemma, which is metrics should be given more relevance, precision or recall? For the ROC Curve which is better roc)curve or RocCurveDisplay?

Please, check out my progress(https://colab.research.google.com/drive/1nO8MsCvqB8zzdnunWs6At-IfpvqS1Mud?usp=sharing)

GemmaTuron commented 1 year ago

Hi @Yayeks !

Good start, some suggestions:

About your questions:

Yayeks commented 1 year ago

Okay, thank you for your suggestions @GemmaTuron, i will implement it.

Yayeks commented 1 year ago

Hello @GemmaTuron, i have taken note and put into use, some of the suggestions you mentioned earlier.

So the idea is to build a binary classification ML model using MorganBinaryClassifier on the dataset "Tox21 NR-AR-LBD". After analyzing the model, it was discovered that we have an imbalanced dataset. So we are dealing with an imbalanced classification problem. Most often, imbalanced problem are difficult to model. More information are in the Colab notebook attached. Yayeks

Yayeks commented 1 year ago

@GemmaTuron this the above link shows my progress so far, your suggestions/corrections are highly welcome. Thank you.

GemmaTuron commented 1 year ago

Hi @Yayeks !

Good Job, to conclude the task you could make a summary of your findings here for everyone?

Yayeks commented 1 year ago

Problem

In this issue, we focus on building a predictive classification ML model using the Therapeutics Data Commons (TDC) datasets. I was assigned a TOX21 dataset which contains qualitative toxicity measurements for 7,831 compounds.

Simply put, we are going to be predicting toxicity outcomes of drug treatments in humans. The specifics of which it predicts if a drug will activate nuclear receptor androgen receptor ligand-binding domain which can go into the nucleus of the cell and bind the DNA, activating and inhibiting genes which can lead to toxicity.

We would we taking the following steps in order to achieve our aim

  1. Load our data
  2. Analyse our data
  3. Visualization
  4. Preparation of Data
  5. Train the model
  6. Model Prediction
  7. Evaluation of Model

Load your data

Our dataset Tox21 NR-AR_LBD was gotten from TDC website, we start by installing TDC and all necessary packages we might need. Since the Tox21 dataset consists of 12 different targets, we first load the whole Tox21 dataset and then select our specific assay (NR-AR-LBD). Our dataset consists of 6,758 drugs.

We then proceeded to split our datasets into using the TDCommons get_split method which resulted in the Train, Validation and Test sets.

Analyse your data

The next step is to get an understanding of our data. In our TOX 21 NR-AR-LBD datasets, we have two features which are the Drug and the Drug ID, and one target variable which specifies if the drug is inactive/non-toxic indicated by a 0 or active/toxic which is indicated by a 1.

We have

From the above, we can see that the number of inactives are much more than the number of actives, the dataset is unevenly distributed, such dataset are called imbalanced dataset. This leads to an imbalanced classification, Imbalanced classification problem is hard to model ☹ because the class distribution is not balanced.

Visualization

It is said that the human brain processes visual content 60,000 times faster than text. That means that a picture is actually worth 60,000 words. So we choose to visualize the data to get a better understanding of the numbers using the matplotlib package to plot the data outcomes. pie_bar_chart We can see how miniscule the actives are represented in the large scheme of things.

RDKIT package is used to visualize and have an idea of what an active / inactive molecule looks like. It is a useful visualization tool for drawing chemical molecules. molecules

GemmaTuron commented 1 year ago

Hi @Yayeks

Can you explain why are you using this:

# convert our predicted values for validation to discrete values using the optimum threshold
Y_disc_predict_val = np.where(Y_predict_val > thrsh_score, 1, 0)

As input for the fpr and tpr calculation?

Yayeks commented 1 year ago

Preparation of Data

We splitted our data into a feature list and target list. The lazy-qsar package is used because it automatically converts molecules to morgan fingerprints. Morgan fingerprint is the most popular molecular fingerprint. Molecular fingerprints are essential cheminformatics tools for virtual screening and mapping chemical space.

Train our ML model

This is a phase in the data science development lifecycle where practitioners try to fit the best combination of weights and bias to a machine learning algorithm in order to minimize a loss function over the prediction range. The purpose of model training is to build the best mathematical representation of the relationship between data features and a target label.

We installed the lazy-qsar package and imported the MorganBinaryClassifier to be used for classification. Note: We used a classification technique because the target variable is discrete

The performance of the model determines the quality of the applications that are built using it. We use the train set to train the model. We do this by instantiating the model and fitting the features and target variable into it. It keeps iterating till it finds the best fit and the retrained model which in our case is RandomForestClassifier(criterion='entropy', max_features=0.0635427281524705, max_leaf_nodes=15, n_estimators=10, n_jobs=-1)

We saved our model in joblib format because joblib is faster in saving/loading large NumPy arrays and for easy retrevial from Google Drive.

Model Prediction

We then proceeded to model prediction using our trained model. We make predictions for both the test and validation set using the predict_proba method. The outcome returns continuous values, we would have to convert it to discrete values based on certain threshold of probability. The default threshold is 0.5 but since it is an imbalanced class problem, this default threshold may not work properly. We used ROC Curves to find the optimal threshold for the classifier then we converted our predicted values to discrete values using the optimum threshold.

Evaluation

Here we test the results of our prediction, we use a different dataset (excluding the train set) to evaluate the performance of the model. We imported and used some necessary functions from scikit learn, otherwise known as sklearn for evaluating classification models.

There are four possible outcomes that could occur when performing classification models:

This is better shown using the confusion matrix Below is the confusion matrix for validation set ConfusionMatrix_Validation

The AUC is 0.7219512195121951 ROC_Validation

F1-score which is also a measure of a model's accuracy on a dataset is used to evaluate the validation dataset is 0.5454545454545455 while the precision and recall are both 0.6923076923076923 and 0.45 respectively.

Below is the confusion matrix for test set ConfusionMatrix_Test

The AUC for test set is 0.753809231456159 ROC_Test

F1-score which is also a measure of a model's accuracy on a dataset is used to evaluate the test dataset is 0.6176470588235294 while the precision and recall are both 0.7777777777777778 and 0.5121951219512195 respectively.

Recommendation

I think the model performance can be improved by resampling the data, we can oversample the minority class using replacement. This technique is called oversampling, however doing so can lead to over-fitting. We can undersample the majority class by removing some of it, this technique is called undersampling but this could lead to information loss. So, i suggest we combine both undersampling and oversamplying techniques. After sampling the data we can get a balanced dataset for both majority and minority classes. So, when both classes have a similar number of records present in the dataset, we can assume that the classifier will give equal importance to both classes and hereby enhance the prediction.

Yayeks commented 1 year ago

The link to the work done (Colab notebook) is Yayeks Outreachy Contribution

Yayeks commented 1 year ago

Hi @Yayeks

Can you explain why are you using this:

# convert our predicted values for validation to discrete values using the optimum threshold
Y_disc_predict_val = np.where(Y_predict_val > thrsh_score, 1, 0)

As input for the fpr and tpr calculation?

That is so because I converted the continuous values to a discrete ones(0 and 1s only) and used it as the prediction section of the roc_curve input

Chigoziee commented 1 year ago

Great job @Yayeks, very good work!!! why don't you try subsampling your dataset. this will help you a lot in your results. from your pie chart it shows that your active labels from the dataset only amounts to 3.7% of the entire dataset. subsampling will help improve your models ability to predict active labels

Yayeks commented 1 year ago

Thanks @Chigoziee , that is a good idea but wouldn't that result in loss of information from the inactives seeing as I would be reducing it a lot?

Ng-ethe commented 1 year ago

@Yayeks fit your data into the Morgan Binary Classifier or the Random Forest Classifier. I see in the quoted text you used the Random Forest Classifier.

Yayeks commented 1 year ago

@Ng-ethe The data is fitted into the MorganBinaryClassifier which in turn uses the Random Forest Classifier through AUTOML.

Chigoziee commented 1 year ago

It wouldn't necessarily reduce how well your model learns to predict inactives, it'll just make it less biased in it's predictions. In the sense that it'll make better predictions for the active class. Nevertheless, your results now are still very good.

Yayeks commented 1 year ago

Okay @Chigoziee I will try that out.

GemmaTuron commented 1 year ago

Hello @Yayeks !

Thanks for your contributions to Ersilia during the Outreachy Application Period! We hope you have learnt and enjoyed as much as the Ersilia team did! I will close this issue as this was part of the application period solely.