🦠 Model Request: Tox21 NR-PPAR-gamma

GemmaTuron commented 2 years ago

Model Title

Tox21 NR-PPAR-gamma (TDCommons)

Publication

Hello @Dorothy2020!

As part of your Outreachy contribution, we have assigned you the dataset "Tox21 NR-PPAR-gamma" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.

Code

No response

Dorothy2020 commented 2 years ago

Thanks @GemmaTuron working on it

Dorothy2020 commented 2 years ago

Hello @GemmaTuron

Ideation The idea is to build a binary classification ML model using MorganBinaryClassifier on the dataset "Tox21 NR-PPR-gamma". After I have carefully analyzed the model, l discovered that there is an imbalanced dataset. Therefore I am dealing with an imbalanced classification problem which on most occasions it is difficult to model.

First I connected colab to the drive Load The Data I went on to install and download the packages needed for the model to run successfully. I call the get_split function to split data obtained from TDC then converted the split data to Dataframe , data frame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet, and later saves the dataset in google drive using the panda package Panda Package is an open source python package which provides support for multi-dimensional arrays

Analyse Your Data I have use the three datasets to check the number of molecules

[ ] Below are the molecules in each dataset:
[ ] Number of molecules in dataset for train is 4515
[ ] Number of Molecules in dataset for test is 1290
[ ] Number of Molecules in dataset for valid is 645

Data Visualization Matplotlib Package is a data visualization and graphical plotting library for Python, which I have used to plot the data outcome in both pie chart and bar graph. From the pie chart, there are inactive train datasets compared to the ones in the bar charts which are less inactive this is because it is in imbalanced datasets where we have a lot of inactive than active.

I also use the RDKIT package to visualize the inactive and active molecules as shown below

Model Evaluation The performance of classification models can be done with ROC Curves

[ ] ROC-AUC
[ ] Contingency tables
[ ] Precision and Recall Scores
The ROC-AUC metric is based on a graphical representation of the receiving operating characteristic curve and it can be represented along the two axes, respectively represented by True Positive Rate (TPR, fraction of true positives) and False Positive Rate (FPR, fraction of false positives)
Consequently, a good model will have a large ROC-AUC as shown below

Precision and recall Precision and recall use Confusion Matrix to communicate to the analyst the degree of error of the model

Confusion Matrix

-623 Molecules being classified by the model as being Non-toxic are actually Toxic (True Negatives)

0 molecules are being classified as non-toxic and are actually non-toxic ( FalsePositives) -22 molecules are being classified as Non-toxic yet they are actually Toxic (False Negative)
- 0 molecules were classified as Toxic yet they are actually non-toxic. (False Positive)

Conclusion To conclude I have found out that these are the metrics for evaluating a binary classification model.

Recommendation Model performance can be improved by improving the training time It's supposed to change actually that is if we pass in let's say 10 minutes the same as 600 seconds in the time_sec_budget then the model is supposed to train for 600 seconds that's 10 minutes, if we pass on 1800sec then the model is supposed to train in 30 minutes, which I have done but not working. Another way to improve the model performance is by resampling the data this can be done in two parts, first a training set and second a validation set or hold-out set.

Dorothy2020 commented 2 years ago

Hello, @GemmaTuron I am done with week 3. Kindly check if everything is working fine. Thank you

EstherIdabor commented 2 years ago

Hi @Dorothy2020, your model doesn't look like it's performing well, a hint on how to make it perform better will be to increase the training time and try resampling your data. Also while reporting on the performance of your model you should include the auroc value and comment on it.

Dorothy2020 commented 2 years ago

Hi @Dorothy2020, your model doesn't look like it's performing well, a hint on how to make it perform better will be to increase the training time and try resampling your data. Also while reporting on the performance of your model you should include the auroc value and comment on it.

@EstherIdabor thank you for this

GemmaTuron commented 2 years ago

Hi @Dorothy2020 ,

I am missing a bit of interpretation of the results you are getting, and what could you do to improve the model. Please comment on this issue thread, link it to your outreachy contribution profile and go onto preparing the final application

Dorothy2020 commented 2 years ago

Hello, @GemmaTuron I have updated everything and now it looks better

Dorothy2020 commented 2 years ago

Hi @Dorothy2020 ,

I am missing a bit of interpretation of the results you are getting, and what could you do to improve the model. Please comment on this issue thread, link it to your outreachy contribution profile, and go on preparing the final application

Hello, @GemmaTuron I have updated the result and recommendations and now it looks at least better. Dorothy2020 Outreachy Contribution Here is the link, Thank you.

Dorothy2020 commented 2 years ago

Hello @GemmaTuron Kindly assist me in closing the issue. Thank you

Dorothy2020 commented 2 years ago

Hello kindly check my updated GitHub issue. Thank you

On Mon, Oct 31, 2022 at 3:54 PM gemmaturon @.***> wrote:

Hi @Dorothy2020 https://github.com/Dorothy2020 ,

I am missing a bit of interpretation of the results you are getting, and what could you do to improve the model. Please comment on this issue thread, link it to your outreachy contribution profile and go onto preparing the final application

— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/451#issuecomment-1297044943, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3Z2KCQE2MI6LLTVZHFIWLWF66PHANCNFSM6AAAAAAROYBOBQ . You are receiving this because you were mentioned.Message ID: @.***>

ersilia-os / ersilia