ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
225 stars 146 forks source link

🦠 Model Request: Tox21 - NR-ER-LBD #424

Closed GemmaTuron closed 2 years ago

GemmaTuron commented 2 years ago

Model Title

Tox21 - NR-ER-LBD (TDC dataset)

Publication

Hello @Malikbadmus!

As part of your Outreachy contribution, we have assigned you the dataset "Tox21 NR ER LBD" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

Malikbadmus commented 2 years ago

Thank You @GemmaTuron.

Malikbadmus commented 2 years ago

I'll be alighting the steps taken to create a predictive model based on the dataset provided.

The steps can be summarized below as:

  1. Data Upload
  2. Data Analyzing and Visualization
  3. Data Pre-processing
  4. Model Training
  5. Model testing and Validation
  6. Model Evaluation
Malikbadmus commented 2 years ago

Data Upload

Tox21(Toxicology in the 21st century ) which is a collaboration between several US federal agencies to test substances that adversely affect human health, contains qualitative toxicity measurements for 7,831 compounds, I'll be working on the NR-ER-LBD which is from the Nuclear Receptor Signaling pathway, and is among the 12 assays available in Tox21 Dataset.

The first thing I did in this step is to install the TDC( Therapeutics Data Commons) package, Then I imported Tox21 and retrieve the specific assay that I want to work with.

Then I proceed to split the dataset into three(3), one for training another for validating, and the last one for testing the model, after which the three(3) datasets were saved in my google drive.

Malikbadmus commented 2 years ago

Data Analyzing and Visualization

When Building an ML Model, the dataset used is split into three (3) parts. The first dataset is exposed to the model and is used to "Learn" or "Teach" the Model, the other dataset is unexposed to the model and is used to evaluate and test the performance of the model.

The "Tox21 NR ER LBD" datasets have three variables.

  1. Drug-ID
  2. Drug
  3. Y(Bioactivity)

Datasets from TDC(Therapeutics Data Commons) have already undergone the Data splitting process. When "Tox21 NR ER LBD" was downloaded from their database they were composed of the following.

The matplotlib is a python package that aids the understanding of our data through visualization. To better understand our Datasets and not just see them as abstract figures but instead as a set of interactive data, Matplotlib was imported, I created a pie-chart that shows the frequency of the different Bioactivity State.

From the Pie-Chart we can see the distributions of molecules in active and inactive state and deduce that Molecules that are in an inactive state are the most common, while Molecules with active bioactivity are less common from the Train Datasets of "Tox21 NR ER LBD".

lolly

Just as the matplotlib package allows us to project our data in visual form. We also have the RDKit package that shows us a glimpse of what our molecular chemical structure looks like.

I used the RDKit to draw two sets of molecules, in the first set 3 toxic molecules that were harmful to human were drawn and in the second set, I drew 3 inactive or nontoxic molecules.

Toxic Molecules up2

Non-toxic Molecules up3

GemmaTuron commented 2 years ago

Great @Malikbadmus,

Please share the link to the notebook when you are ready!

Malikbadmus commented 2 years ago

Great @Malikbadmus,

Please share the link to the notebook when you are ready!

Many thanks @GemmaTuron. You can find the link below.

https://colab.research.google.com/drive/1ZAKGb2SYzwIsILXDZVFmRCgzNGu3eZSS?usp=sharing

Malikbadmus commented 2 years ago

Data Preprocessing

The lazy-Qzar package as its name implies simplified most of the steps, we would have gone through in our modeling journey.

The goal of our data is to predict if the molecules are toxic or harmless to humans.

The column named "Y" in our Datasets is our target variable (i.e., the variable which we want to predict), by analyzing our target variable we can see that it has only two possible values, which are 0.0 and 1.0, the resulting prediction problem is, therefore, a binary classification problem. The remaining features of our Datasets (Drug) will serve as input variables for our model.


Model Training

The computer only recognizes data in numeric vectors and images, so to pass the Input Data (Drug) which is in a molecular form, there's a need for us to convert the SMILES string into a numeric format, but since we are working with the Lazy_Qsar modules, it automatically converts our SMILES string into Morgan Fingerprints.

We start by cloning the Lazy-qzar repository and installing the necessary packages. The process of training a model is called fitting. And we do this by passing (fitting) the input variables i.e The Drug molecules and our target variable (i.e the variable we want to predict) into the model using the Morgan Binary Classifier class.

The method fit is composed of two elements.

  1. A learning Algorithm.
  2. The Model states.

The learning algorithm processes the input data along with the target variable and sets the model states. These model states will be used later to predict our test and validate Data.

We import the Joblib python function and then proceed to save our model in the Joblib format.

Model Testing and Validation

The Validate Datasets and Test Datasets were set aside to obtain predictions from our Learning Model, they will also serve as unbiased data, as they were not used in the fitting process and therefore have not been Memorized by the Model.

From this new Data, we Separate our Input Data (Molecules) and the target data to predict (Bioactivity), we then pass the Molecules into the Model to obtain predictions and then compare it with our True value.

The process of comparing and evaluating our Model is called Model Evaluation, and there are several tools both visual and numeric that are available to grade the performance of a Model.

Malikbadmus commented 2 years ago

Model Evaluation

The Python Libraries that were used in Evaluating our Model are listed below.

  1. Sckit-Learn
  2. Matplotlib
  3. Pandas
  4. numpy
  5. sklearn

The Model Was evaluated based on the following performance metric

  1. Confusion Matrix/Contingency Matrix
  2. Precision and Recall
  3. AUC ROC curve
  4. AUROC Value

Validation sets Performance

We use the predict() function to get our prediction value of the Molecules, and the binary result gotten (0 or 1) can be used to plot the Confusion Matrix and get the Precision and Recall value. To get the AUROC Value and plot the AUROC Curve we make use of the predict_proba() function which gives us the probability of our prediction either being Toxic or Harmless.

The prediction probability threshold is 0.5, Our model Predicted value will therefore be:

correct

From the above Confusion matrix, which shows the result of our prediction when compared to the actual value of our Datasets, we can infer the performance of our Model as:

  1. TN(True Negative) = 659 Molecules were classified by the model as being Nontoxic and are actually Nontoxic.
  2. FP(False Positive) = No Molecules were classified by the Model as being Toxic when they are actually Harmless i.e. The Prediction has no Type II Error.
  3. FN(False Negative) = 36 Type I Error i.e., 36 Molecules were wrongly classified as being Harmless.
  4. TP (True Positive) = Only one molecule was classified as being Toxic and was actually Toxic.

Total Observation = TP + FP + FN + TN = 659 + 0 + 36 + 1 = 696.

You will recall that the total molecules in the Validation Datasets are 696

Precision Value

Looking at the Predictions, how many molecules were correctly predicted by the model as being Active? The answer to this question is the Precision Value. and it is calculated as.

Recall Value

Recall Value is how well the Model is predicting Molecules with active Bioactivity.

AUC ROC Curve

The ROC (Receiver Operating Characteristics) Curve tells us the performance of our model in distinguishing between if a chemical substance is toxic or harmless to humans.

To do this, we make use of the python function predict_proba to get the probabilities values of our prediction.

We set the threshold of our probability to 0.5, any probability value above this threshold will be predicted as Toxic and Below this threshold, the value will be Nontoxic.

Our ROC curve will be drawn based on the proportions of Chemical substances that were predicted to be Toxic upon the number of Substances that are actually toxic(True Positive Rate) against the proportion of chemicals that were predicted to be Toxic upon the total number of chemicals that are actually harmless (False Positive Rate).

Git1

AUROC Value = 0.630

It measures how good the model is in Separating the prediction values. An AUROC (Area Under the ROC curve) value that is closer to 0.5 indicates the Model is performing poorly and its prediction is almost random, If the value is closer to 1 it indicates our model doing a good in identifying the prediction values.

Our AUROC value of 0.630 is poor and this could be because we have highly imbalanced validation data sets.

Test sets Performance

The same step highlighted in the Validation sets was followed to obtain the same performance metric for the Test sets.

Correct2

  1. TN(True Negative) = 1331 Molecule substances classified by the model as being Harmless are actually Harmless.
  2. FP(False Positive) = No Molecules were classified by the Model as being Toxic when they are actually Harmless i.e. The Prediction has no Type II Error.
  3. FN(False Negative) = 56 Type I Error i.e., 56 Molecules were wrongly classified as being Nontoxic.
  4. FN (True Positive) = 4 Molecule substances were classified as being Toxic by the model and were actually Toxic.

Total Observation = TP + FP + FN + TN = 1331 + 0 + 56 + 4 = 1391

You will recall that the total molecules in the Test Datasets are 1391

Precision Value

Recall Value

AUC ROC Curve

It follows the same argument as the validation sets

Git3

AUROC Value = 0.751

The AUROC Value gotten during the Test sets is much better than the validation set.

To understand the AUROC better, Consider this, Assuming we set our threshold to 0 i.e. Any prediction value that is above 0 should be classified as Toxic. In this case all our model predictions will be Toxic, so we will only have True positives and False positive. If we calculate our AUROC Value with this, the value gotten will be 1.

Similarly if we set our threshold to 1 i.e Any prediction value that is above 1 should be labeled Toxic. All our model prediction in this situation will be Harmless, in this case, True Positive and False positive becomes 0. And our AUROC Value will be 0.

Malikbadmus commented 2 years ago

@GemmaTuron , the Updated link.

https://colab.research.google.com/drive/16fRFdsajWY0xuzsU2ZuEk_JEueGfhqid?usp=sharing

GemmaTuron commented 2 years ago

Hi @Malikbadmus

Good start! can you tell me a bit more what do you think about your model performance? If I am a researcher doing experiments n the lab and you hand me these results, would you warn me about anything?

Also, a tip: think about how long are you letting the model train for

Malikbadmus commented 2 years ago

@GemmaTuron , I got the analysis of my model evaluation wrong, it has been fixed now, the precision value and recall value have also been correctly re-calculated.

Malikbadmus commented 2 years ago

Hi @Malikbadmus

Good start! can you tell me a bit more what do you think about your model performance? If I am a researcher doing experiments n the lab and you hand me these results, would you warn me about anything?

Also, a tip: think about how long are you letting the model train for

Looking at the Perfomance evaluation of the model, The researchers can be rest assured that when the model Proclaims a Drug as being Toxic, that Drug is definitely Toxic, but the model is also very poor in predicting this set of Drugs that are Toxic.

There's a high probability that a drug predicted with this model will end up being classified as harmless even when it's toxic, And that's a very dangerous thing as we don't want Drugs that are labeled as harmless but are actually quite Toxic to be used by Humans.

Also, please can you expatiate, does the time taken to train a model affects its predictive ability?

Malikbadmus commented 2 years ago

@GemmaTuron The AUROC value of 0.63 is low. I'm suspecting that this is because of training my Model on an Imbalanced dataset (6,605 to 350), the model prediction will lean more towards being Nontoxic. I am certain thst this is an overfitting problem.

The positive from this is that the Model precision in predicting Toxic Drug correctly is 100%. Or does that mean that training models on Imbalance Data will increase the precision score towards the imbalance variable?

I should probably also try other evaluation metrics like the F1 score since our data is imbalanced, and we need more attention on the Drugs that are Toxic?

Malikbadmus commented 2 years ago

@GemmaTuron , I retrained my model again on the same Datasets and the result I'm getting are different. Can you please tell me why?

https://colab.research.google.com/drive/16fRFdsajWY0xuzsU2ZuEk_JEueGfhqid?usp=sharing#scrollTo=EQPNRJ6YBuKK

thormiwa commented 2 years ago

@Malikbadmus It is like that because your model is making different predictions each time it is trained, even when it is trained on the same data set each time.

Malikbadmus commented 2 years ago

@thormiwa, why does it do that though?

The learning algorithm is the same, the dataset the same, no noise or unwanted data was introduced to the Dataset the second time, i just want to understand the underlying reason.

Malikbadmus commented 2 years ago

@GemmaTuron, I use the predict function to bypass the for loop I would have needed for the Confusion matrix, Precision and recall value.

I then used the predict_proba to set up my AUROC, i.e. I predicted my Test Data twice.

Would there be any backlash from this?. is it a good practice? I analyzed the value gotten in both cases and they corresponded to each other.

thormiwa commented 2 years ago

@thormiwa, why does it do that though?

The learning algorithm is the same, the dataset the same, no noise or unwanted data was introduced to the Dataset the second time, i just want to understand the underlying reason.

@Malikbadmus it is training data randomly, It is called stochasticity in Machine learning. it is like when the learning algorithm is run on the same data, it learns a slightly different model. In turn, the model may make slightly different predictions and even sometimes a different AUROC value. I noticed the same case when i trained my model as well.

ZakiaYahya commented 2 years ago

@Malikbadmus it's due to the fact that every time we train our model, it initialize with random weights, due to which the same model trained on same set of data can produce different results.

GemmaTuron commented 2 years ago

Hi @Malikbadmus !

Very good job and tests trying to understand what happens. Indeed, the random weights are the cause of the slightly different results.

GemmaTuron commented 2 years ago

@GemmaTuron, I use the predict function to bypass the for loop I would have needed for the Confusion matrix, Precision and recall value.

I then used the predict_proba to set up my AUROC, i.e. I predicted my Test Data twice.

Would there be any backlash from this?. is it a good practice? I analyzed the value gotten in both cases and they corresponded to each other.

Predicting twice with the same model and same data shouldn't be an issue, but good practice is to save your initial prediction and stick to that for the analysis, as for more complex models, for example if you are chaining different models, it can be important.

GemmaTuron commented 2 years ago

@Malikbadmus I think you did a very good job already, and this issue is ready to be closed when you want. Focus the remaining time in writing your final application, and include this part of your contribution to outreachy!

Malikbadmus commented 2 years ago

oh @GemmaTuron , @ZakiaYahya and @thormiwa . Many thanks for this!!. I'm reading up on Random weight Initialization.

I'm really learning a lot from this. Thank you guys!!

Malikbadmus commented 2 years ago

@GemmaTuron, I use the predict function to bypass the for loop I would have needed for the Confusion matrix, Precision and recall value. I then used the predict_proba to set up my AUROC, i.e. I predicted my Test Data twice. Would there be any backlash from this?. is it a good practice? I analyzed the value gotten in both cases and they corresponded to each other.

Predicting twice with the same model and same data shouldn't be an issue, but good practice is to save your initial prediction and stick to that for the analysis, as for more complex models, for example if you are chaining different models, it can be important.

Oh..I get it. Thanks!

Malikbadmus commented 2 years ago

@Malikbadmus I think you did a very good job already, and this issue is ready to be closed when you want. Focus the remaining time in writing your final application, and include this part of your contribution to outreachy!

Many thanks @GemmaTuron!! This has truly been a very rich learning experience for me. Let me just add a final note on this, and then we can close this issue.

Malikbadmus commented 2 years ago

Conclusion

Training a good AI/ML model will involve trying to achieve balance between Precision and Recall, a good model should be able to classify your data accurately.

When you have an Imbalanced data set like the Tox21 - NR-ER-LBD (TDC dataset) that I worked on. One of the method you can consider is reducing your threshold.

when I reduced my threshold to 0.3, I was able to get a recall value of close to 46% up from the 7% I got earlier, but it was a trade-off with my precision as it went down to 67% from 100%.

And even though ML Models cannot make perfect prediction, a Model that is able to identify more Toxic Substances will represent an opportunity cost in Precision. we will have to choose which is more acceptable, Classifying a Toxic Drug as Harmless or Classifying a Harmless Drug as Toxic.

Another way to to deal with Imbalanced Dataset is by increasing the size of the underrepresented class in the Train Data sets, that way your Model can have more range in learning. My Colab link is attached below.

https://colab.research.google.com/drive/16fRFdsajWY0xuzsU2ZuEk_JEueGfhqid?usp=sharing#scrollTo=-_nPpAPNIKFL

paulinebanye commented 2 years ago

Very well written @Malikbadmus 👍 . Impressive analysis.

Malikbadmus commented 2 years ago

Many thanks @pauline-banye

Zainab-ik commented 2 years ago

Conclusion

Training a good AI/ML model will involve trying to achieve balance between Precision and Recall, a good model should be able to classify your data accurately.

When you have an Imbalanced data set like the Tox21 - NR-ER-LBD (TDC dataset) that I worked on. One of the method you can consider is reducing your threshold.

when I reduced my threshold to 0.3, I was able to get a recall value of close to 46% up from the 7% I got earlier, but it was a trade-off with my precision as it went down to 67% from 100%.

And even though ML Models cannot make perfect prediction, a Model that is able to identify more Toxic Substances will represent an opportunity cost in Precision. we will have to choose which is more acceptable, Classifying a Toxic Drug as Harmless or Classifying a Harmless Drug as Toxic.

Another way to to deal with Imbalanced Dataset is by increasing the size of the underrepresented class in the Train Data sets, that way your Model can have more range in learning. My Colab link is attached below.

https://colab.research.google.com/drive/16fRFdsajWY0xuzsU2ZuEk_JEueGfhqid?usp=sharing#scrollTo=-_nPpAPNIKFL

Thank you @Malikbadmus for the suggestions. Due to the imbalance dataset, I also reduced the threshold of my model to 0.2, I got a better performance from my model. I also tried undersampling as you mentioned and the AUC value increased also. But i kept the initial analysis of my model for submission.

Malikbadmus commented 2 years ago

I'm glad to be of help @Zainab-ik , @GemmaTuron the issue can be closed now.

GemmaTuron commented 2 years ago

Good job @Malikbadmus