ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
211 stars 145 forks source link

🦠 Model Request: Tox21 NR-ER #442

Closed GemmaTuron closed 1 year ago

GemmaTuron commented 1 year ago

Model Title

Tox21 NR-ER (TDC Dataset)

Publication

Hello @carcablop!

As part of your Outreachy contribution, we have assigned you the dataset "Tox 21 NR-ER" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

carcablop commented 1 year ago

Hello @GemmaTuron I share the link of the colab with the development of my model.. https://colab.research.google.com/drive/1pQWtLVjICg_2oDde74N5WvmfYCKvq7lT#scrollTo=QD8XeEAaYvOf

Next, I will explain the step-by-step of the build and try a binary classification model for the data set that was given to me. Finally an analysis of results and conclusion.

STEP 1: GETTING MY TRAINING DATA

For this contribution, a binary ML classification model will be trained and built to predict and evaluate the level of toxicity of chemical compounds or drugs in humans. For this training, "TOX21 NR-ER" from the Therapeutics Data Commons.

As listed in the TDC commons database, Tox21 is a dataset containing measurements of human drug toxicity analyzed in 12 different assays. The assay I'm going to work on is NR-ER which stands for Nuclear Receptor-Estrogen Receptor alpha.

Load my data:

The first step in loading the data is to install pyTDC to download the TOX 21 NR-ER dataset. Since tox21 has several trials, the first thing I do is list all the trials, and I can see on the screen that my NR-ER trial is in position 4 of the list, so now I can load my complete dataset with my specific essay. To train an ML model we divide the dataset into three data subsets: train: to train the model, validation: to evaluate the performance of the model, and test: to evaluate the performance of the final modeling. Once my three data sets are obtained, through the head() function I can verify the type of data that the data set contains. For example, I can see that the dataset “train” contains the parameters Drug_ID Drug and Y. datasets

The data that we are going to use are Drugs, the molecules (SMILES), and “Y” as output (bioactivity), here we see that the values are 1 and 0 for the output. Since we are working with Binary Classification Model, the classification is 0 or 1, where 1 is active (toxic) and 0 is inactive (non-toxic).

Finally, each of these data sets is saved in a .csv file.

STEP 2: UNDERSTAND MY DATA: ANALYSIS AND GRAPHICAL REPRESENTATION

What are we trying to predict?:

For this given Tox21 NR ER dataset, what we are trying to predict is the toxicity of the molecules, so I need to get the data of the molecules that are toxic from each of my datasets. The toxicity information in the datasets is labeled as "Y" when I access the Y column of my dataset I can see 0 and 1, which indicates for 0 is that the molecule is not toxic to humans and 1 is that the molecule it is toxic to humans. to obtain the predictions I do it on the active molecules, in the data set the molecules are labeled with the name "Drug".

Is this a classification or regression problem?

It is a classification problem since the data we use to predict is binary, 0(inactive) or 1(active).

How many data points do we have in total and in each set(train/validation/test)? how many actives and inactives do we have?

Next, once the dataset has been split, I obtain for each train, validation, and test dataset, the number of molecules, and I classify them into active and inactive for each data set.

For a better way to see the data, optimize in code, and have a better visualization, I create a new data structure with a data frame, this way I can see the data in table form, and compare the data:

dataframe_datasets

then, I represent de data in a bar chart: plot_bar

In this part I can see that there is an imbalance between the active and inactive molecules, it is seen that the number of active molecules is very few compared to the inactive molecules. In other words, we have an unbalanced data set. Therefore it is important to train the model well for the active molecules since they are very few and it could be difficult for the model to learn. The products or drugs that could have a high level of toxicity are the ones that should be prioritized to validate the model.

carcablop commented 1 year ago

Draw the molecule

In the results of the data, the chemical structure of the molecule will also be taken into account, that can also indicate to the scientists, the type of chemical structure that can be toxic and what is not toxic.

imagen

Inactive molecule for the train set: imagen

Active molecule of the train dataset:

carcablop commented 1 year ago

STEP 3: Train your ML Model:

Before this step, the smiles of each data set were extracted, in this case, the column of the data frame labeled "Drug" of each data set is accessed, and the outputs Y (0 or 1) of each data set, column "Y" of the data frame of each data set is also accessed, This data is stored in lists and used in the following steps.

For the accretion and training of the model, we install the Lazy-Qsar package from ersilia, which comes pre-installed with Morgan Fingerprint, which already converts the string of SMILES into vectors (vectors are used to be read by a computer). Once installed, the MorganBinaryClassifier class is imported, we instantiate the class, create the model and use the fit function for training, passing as input the association between the vector (smiles, accessing the “Drug” column of the dataset train)” and its output (“Y ” from the train dataset). The estimated time for model training was 15 min. The model was trained in the indicated time and finally, I save the model in joblib format.

STEP 4: MODEL VALIDATION

Once the model is trained, it can be used to predict from the validation dataset. Here I use the function predict_prob(), to get more accurate data in the evaluation when the model is evaluated. We will pass the molecules of the validation data set as input to the function, and we will obtain as results or output the probabilities of active and the probabilities of inactive, that is, I do not directly obtain 0 and 1.

imagen

Once I get the predictions for my data set validation, I can evaluate my model, and see how good it is for classification.

carcablop commented 1 year ago

SEPT 5: MODEL PERFORMANCE:

To evaluate the performance of the model and see if it's good enough, we use the following performance ratings, all of which can be used with the python sklearn package:

  1. Roc Curves
  2. Contingency tables: Confusion matrix
  3. Precision and recall scores.

Roc curves: We will use it to know how well the model is working, ROC curve is an evaluation metric for binary classification models, that curve plots two parameters the TPR (TRUE POSITIVES RATE) against the FPR (false positives rate) in different classification thresholds.

AUC= is the area under the curve, it is the measure of the ability of a classifier to distinguish between positives and negatives. Auc ranges in value from 0 to 1, where 0 indicates that the predictions are 100% incorrect and 1 indicates that the predictions are 100% correct.

We use sklearn to easily calculate the ROC AUC imagen

Now I draw the Roc Curve

imagen

The AUC gave a value of 0.711, which indicates that it is very close to 1, with 1 being an indicator that the predictions are 100% correct. Also, taking into account that the validation data set is very unbalanced, this AUC value could indicate that the model is making a good classification. Although for this type of case in which the inactive molecules outnumber the active ones, it is important to take into account other metrics to measure the performance, in this case, the confusion matrix could give us more data to evaluate the performance.

carcablop commented 1 year ago

2. CONFUSION MATRIX: Another way to measure the performance of the model is using a confusion matrix.

The matrix compares the actual target values ​​with those predicted by the machine learning model. This allows us to see what types of hits or misses the model is having. In other words, it gives us a vision of how well the model is working and what kind of errors it is making. The confusion matrix is ​​a 2X2 matrix with 4 values:

To do my model evaluation for the Tox21 NR-ER dataset, we will use the sckit learn python library to get the confusion matrix.

To the confusion matrix we pass as inputs:

Below you can see the error, and why it was necessary to implement that conversion.

imagen

Once the error is resolved: imagen

It can be seen that we have the list of zeros and ones with which we will work to obtain the following confusion matrix:

imagen

As a result of the confusion matrix we have four different values ​​like this:

In this data set, what we are interested in knowing are the molecules that are toxic, we are really interested that there are not so many false negatives, since predicting a number of molecules as non-toxic, when in fact they are toxic, can be very dangerous because we would be bringing to the pharmaceutical market (for example) a number of drugs that can threaten human health. This is where the confusion matrix is ​​very important since we can warn about these false negatives. Given the seriousness of the problem, I think it is very important to really know how many positive cases are being correctly predicted so that a human being does not become intoxicated with a drug, and thus verify the reliability of the model.

Therefore, we are going to calculate the Precision and Recall metrics:

Precision: It will determine whether the model is reliable or not, referring to how close the result is to true precision. How many positives are currently positive. The formula is the following: TP/ TP + FP = 23/ 23+12 = 0.65 -> 65% of precision.

That is, 65% of active molecules, correctly predicted. We could say that this precision value is not so bad, since we have unbalanced data. Therefore, if it is important to obtain our recall metric, since we need to see the cases that are relevant,

Recall: is the rate of true positives, it is the proportion of positive cases that were correctly identified by the model. It is the ability of the model to detect active molecules, or they are toxic.

The formula is the following: TP / TP + FN = 23/ 23 +65 = 0.26- > 26% recall.

This means that 26% you have to detect toxic molecules. A low 26%, with this I can say that our model is not very "sensitive" to detect active molecules, many active molecules escape it, this can be seen by the number of false negatives we have, which are 65.

Here you can see the results on the screen: imagen

As I have said before, here what interests us are the active molecules, therefore we have calculated and analyzed the metrics for those that are positive. And as you can see, our model is very good at predicting negative or non-toxic molecules, since there are more negative cases (non-toxic molecules) than positive ones, but this would not be the case that interests us, but it could cause confusion, assuming that the model is very good at predicting toxicity in humans when it really isn't, because what interests us is to predict the toxic molecules

carcablop commented 1 year ago

IMPROVING THE METRIC: As could be seen previously, the metric that we are interested in improving is the recall metric, since we need to improve the value of false negatives, that is, to reduce false negatives, since we do not want there to be positives or molecules that are toxic, classified as inactive molecules, as this could be dangerous. This is why recall is a useful metric for this dataset.

One way to improve this metric may be by reducing the number of false negatives. For this, we must go to adjust our threshold (0.5) to a lower value, so that it returns more “1” and not so many “0”, that is, increase the assets. It could be modified here.

By lowering the threshold to 0.3 he obtained the following results:

imagen

As can be compared, I have decreased my false negative rate, I was able to improve my recall metric to 39.7%. If I keep lowering my threshold to 0.2 I keep improving the recall, but I can see that the false positive rate increases a lot, so this could be very expensive in terms of costs, I could be dealing with active molecules unnecessarily. So also my accuracy metric gets worse. So depending on how the threshold is set, I may or may not improve my metric according to the objective of my model.

carcablop commented 1 year ago

Now I calculate the metrics for the Test dataset.:

ROC AUC We use sklearn to easily calculate the ROC AUC:

imagen

Here it can be compared that the AUROC is less than the AUROC of the validation data set, and this is because there are more molecules, more inactive molecules than active molecules, compared to the validation data set. Now I draw the Roc Curve

imagen

CONFUSION MATRIX FOR TEST SET:

imagen

Compared to my validation data set, here in the test I have more false negatives, therefore my recall metric is very low compared to the other one. This is due to the large imbalance between active and inactive molecules.

GemmaTuron commented 1 year ago

Hi @carcablop !

Very good job on building the models and reasoning about the outputs, thanks for your work. I'll mark this as completed and you can go onto finishing the Outreachy contribution and preparing the final application