Closed GemmaTuron closed 1 year ago
Hello @GemmaTuron I share the link of the colab with the development of my model.. https://colab.research.google.com/drive/1pQWtLVjICg_2oDde74N5WvmfYCKvq7lT#scrollTo=QD8XeEAaYvOf
Next, I will explain the step-by-step of the build and try a binary classification model for the data set that was given to me. Finally an analysis of results and conclusion.
STEP 1: GETTING MY TRAINING DATA
For this contribution, a binary ML classification model will be trained and built to predict and evaluate the level of toxicity of chemical compounds or drugs in humans. For this training, "TOX21 NR-ER" from the Therapeutics Data Commons.
As listed in the TDC commons database, Tox21 is a dataset containing measurements of human drug toxicity analyzed in 12 different assays. The assay I'm going to work on is NR-ER which stands for Nuclear Receptor-Estrogen Receptor alpha.
Load my data:
The first step in loading the data is to install pyTDC to download the TOX 21 NR-ER dataset. Since tox21 has several trials, the first thing I do is list all the trials, and I can see on the screen that my NR-ER trial is in position 4 of the list, so now I can load my complete dataset with my specific essay. To train an ML model we divide the dataset into three data subsets: train: to train the model, validation: to evaluate the performance of the model, and test: to evaluate the performance of the final modeling. Once my three data sets are obtained, through the head() function I can verify the type of data that the data set contains. For example, I can see that the dataset “train” contains the parameters Drug_ID Drug and Y.
The data that we are going to use are Drugs, the molecules (SMILES), and “Y” as output (bioactivity), here we see that the values are 1 and 0 for the output. Since we are working with Binary Classification Model, the classification is 0 or 1, where 1 is active (toxic) and 0 is inactive (non-toxic).
Finally, each of these data sets is saved in a .csv file.
STEP 2: UNDERSTAND MY DATA: ANALYSIS AND GRAPHICAL REPRESENTATION
What are we trying to predict?:
For this given Tox21 NR ER dataset, what we are trying to predict is the toxicity of the molecules, so I need to get the data of the molecules that are toxic from each of my datasets. The toxicity information in the datasets is labeled as "Y" when I access the Y column of my dataset I can see 0 and 1, which indicates for 0 is that the molecule is not toxic to humans and 1 is that the molecule it is toxic to humans. to obtain the predictions I do it on the active molecules, in the data set the molecules are labeled with the name "Drug".
Is this a classification or regression problem?
It is a classification problem since the data we use to predict is binary, 0(inactive) or 1(active).
How many data points do we have in total and in each set(train/validation/test)? how many actives and inactives do we have?
Next, once the dataset has been split, I obtain for each train, validation, and test dataset, the number of molecules, and I classify them into active and inactive for each data set.
For a better way to see the data, optimize in code, and have a better visualization, I create a new data structure with a data frame, this way I can see the data in table form, and compare the data:
then, I represent de data in a bar chart:
In this part I can see that there is an imbalance between the active and inactive molecules, it is seen that the number of active molecules is very few compared to the inactive molecules. In other words, we have an unbalanced data set. Therefore it is important to train the model well for the active molecules since they are very few and it could be difficult for the model to learn. The products or drugs that could have a high level of toxicity are the ones that should be prioritized to validate the model.
Draw the molecule
In the results of the data, the chemical structure of the molecule will also be taken into account, that can also indicate to the scientists, the type of chemical structure that can be toxic and what is not toxic.
Inactive molecule for the train set:
Active molecule of the train dataset:
STEP 3: Train your ML Model:
Before this step, the smiles of each data set were extracted, in this case, the column of the data frame labeled "Drug" of each data set is accessed, and the outputs Y (0 or 1) of each data set, column "Y" of the data frame of each data set is also accessed, This data is stored in lists and used in the following steps.
For the accretion and training of the model, we install the Lazy-Qsar package from ersilia, which comes pre-installed with Morgan Fingerprint, which already converts the string of SMILES into vectors (vectors are used to be read by a computer). Once installed, the MorganBinaryClassifier class is imported, we instantiate the class, create the model and use the fit function for training, passing as input the association between the vector (smiles, accessing the “Drug” column of the dataset train)” and its output (“Y ” from the train dataset). The estimated time for model training was 15 min. The model was trained in the indicated time and finally, I save the model in joblib format.
STEP 4: MODEL VALIDATION
Once the model is trained, it can be used to predict from the validation dataset. Here I use the function predict_prob(), to get more accurate data in the evaluation when the model is evaluated. We will pass the molecules of the validation data set as input to the function, and we will obtain as results or output the probabilities of active and the probabilities of inactive, that is, I do not directly obtain 0 and 1.
Once I get the predictions for my data set validation, I can evaluate my model, and see how good it is for classification.
SEPT 5: MODEL PERFORMANCE:
To evaluate the performance of the model and see if it's good enough, we use the following performance ratings, all of which can be used with the python sklearn package:
Roc curves: We will use it to know how well the model is working, ROC curve is an evaluation metric for binary classification models, that curve plots two parameters the TPR (TRUE POSITIVES RATE) against the FPR (false positives rate) in different classification thresholds.
AUC= is the area under the curve, it is the measure of the ability of a classifier to distinguish between positives and negatives. Auc ranges in value from 0 to 1, where 0 indicates that the predictions are 100% incorrect and 1 indicates that the predictions are 100% correct.
We use sklearn to easily calculate the ROC AUC
Now I draw the Roc Curve
The AUC gave a value of 0.711, which indicates that it is very close to 1, with 1 being an indicator that the predictions are 100% correct. Also, taking into account that the validation data set is very unbalanced, this AUC value could indicate that the model is making a good classification. Although for this type of case in which the inactive molecules outnumber the active ones, it is important to take into account other metrics to measure the performance, in this case, the confusion matrix could give us more data to evaluate the performance.
2. CONFUSION MATRIX: Another way to measure the performance of the model is using a confusion matrix.
The matrix compares the actual target values with those predicted by the machine learning model. This allows us to see what types of hits or misses the model is having. In other words, it gives us a vision of how well the model is working and what kind of errors it is making. The confusion matrix is a 2X2 matrix with 4 values:
To do my model evaluation for the Tox21 NR-ER dataset, we will use the sckit learn python library to get the confusion matrix.
To the confusion matrix we pass as inputs:
Below you can see the error, and why it was necessary to implement that conversion.
Once the error is resolved:
It can be seen that we have the list of zeros and ones with which we will work to obtain the following confusion matrix:
As a result of the confusion matrix we have four different values like this:
True Positive(TP): 23, means that 23 active (toxic) molecules were correctly classified as true active by the model.
True Negatives(TN): 519, means that 519 inactive (non-toxic) molecules were correctly classified as true negatives by the model.
False positives (FP): 12, means that 12 molecules were incorrectly predicted by the model as toxic, when they were negative, they were non-toxic.
False negatives (FN): 65, means that 65 molecules were incorrectly predicted by the model as non-toxic when they are actually toxic.
In this data set, what we are interested in knowing are the molecules that are toxic, we are really interested that there are not so many false negatives, since predicting a number of molecules as non-toxic, when in fact they are toxic, can be very dangerous because we would be bringing to the pharmaceutical market (for example) a number of drugs that can threaten human health. This is where the confusion matrix is very important since we can warn about these false negatives. Given the seriousness of the problem, I think it is very important to really know how many positive cases are being correctly predicted so that a human being does not become intoxicated with a drug, and thus verify the reliability of the model.
Therefore, we are going to calculate the Precision and Recall metrics:
Precision: It will determine whether the model is reliable or not, referring to how close the result is to true precision. How many positives are currently positive. The formula is the following: TP/ TP + FP = 23/ 23+12 = 0.65 -> 65% of precision.
That is, 65% of active molecules, correctly predicted. We could say that this precision value is not so bad, since we have unbalanced data. Therefore, if it is important to obtain our recall metric, since we need to see the cases that are relevant,
Recall: is the rate of true positives, it is the proportion of positive cases that were correctly identified by the model. It is the ability of the model to detect active molecules, or they are toxic.
The formula is the following: TP / TP + FN = 23/ 23 +65 = 0.26- > 26% recall.
This means that 26% you have to detect toxic molecules. A low 26%, with this I can say that our model is not very "sensitive" to detect active molecules, many active molecules escape it, this can be seen by the number of false negatives we have, which are 65.
Here you can see the results on the screen:
As I have said before, here what interests us are the active molecules, therefore we have calculated and analyzed the metrics for those that are positive. And as you can see, our model is very good at predicting negative or non-toxic molecules, since there are more negative cases (non-toxic molecules) than positive ones, but this would not be the case that interests us, but it could cause confusion, assuming that the model is very good at predicting toxicity in humans when it really isn't, because what interests us is to predict the toxic molecules
IMPROVING THE METRIC: As could be seen previously, the metric that we are interested in improving is the recall metric, since we need to improve the value of false negatives, that is, to reduce false negatives, since we do not want there to be positives or molecules that are toxic, classified as inactive molecules, as this could be dangerous. This is why recall is a useful metric for this dataset.
One way to improve this metric may be by reducing the number of false negatives. For this, we must go to adjust our threshold (0.5) to a lower value, so that it returns more “1” and not so many “0”, that is, increase the assets. It could be modified here.
By lowering the threshold to 0.3 he obtained the following results:
As can be compared, I have decreased my false negative rate, I was able to improve my recall metric to 39.7%. If I keep lowering my threshold to 0.2 I keep improving the recall, but I can see that the false positive rate increases a lot, so this could be very expensive in terms of costs, I could be dealing with active molecules unnecessarily. So also my accuracy metric gets worse. So depending on how the threshold is set, I may or may not improve my metric according to the objective of my model.
Now I calculate the metrics for the Test dataset.:
ROC AUC We use sklearn to easily calculate the ROC AUC:
Here it can be compared that the AUROC is less than the AUROC of the validation data set, and this is because there are more molecules, more inactive molecules than active molecules, compared to the validation data set. Now I draw the Roc Curve
CONFUSION MATRIX FOR TEST SET:
Compared to my validation data set, here in the test I have more false negatives, therefore my recall metric is very low compared to the other one. This is due to the large imbalance between active and inactive molecules.
Hi @carcablop !
Very good job on building the models and reasoning about the outputs, thanks for your work. I'll mark this as completed and you can go onto finishing the Outreachy contribution and preparing the final application
Model Title
Tox21 NR-ER (TDC Dataset)
Publication
Hello @carcablop!
As part of your Outreachy contribution, we have assigned you the dataset "Tox 21 NR-ER" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.
Code
No response