Closed GemmaTuron closed 2 years ago
Thanks @GemmaTuron working on it
Hello @GemmaTuron
Ideation The idea is to build a binary classification ML model using MorganBinaryClassifier on the dataset "Tox21 NR-PPR-gamma". After I have carefully analyzed the model, l discovered that there is an imbalanced dataset. Therefore I am dealing with an imbalanced classification problem which on most occasions it is difficult to model.
First I connected colab to the drive Load The Data I went on to install and download the packages needed for the model to run successfully. I call the get_split function to split data obtained from TDC then converted the split data to Dataframe , data frame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet, and later saves the dataset in google drive using the panda package Panda Package is an open source python package which provides support for multi-dimensional arrays
Analyse Your Data I have use the three datasets to check the number of molecules
Data Visualization Matplotlib Package is a data visualization and graphical plotting library for Python, which I have used to plot the data outcome in both pie chart and bar graph. From the pie chart, there are inactive train datasets compared to the ones in the bar charts which are less inactive this is because it is in imbalanced datasets where we have a lot of inactive than active.
I also use the RDKIT package to visualize the inactive and active molecules as shown below
Model Evaluation The performance of classification models can be done with ROC Curves
[ ] ROC-AUC
[ ] Contingency tables
[ ] Precision and Recall Scores
The ROC-AUC metric is based on a graphical representation of the receiving operating characteristic curve and it can be represented along the two axes, respectively represented by True Positive Rate (TPR, fraction of true positives) and False Positive Rate (FPR, fraction of false positives)
Consequently, a good model will have a large ROC-AUC as shown below
Precision and recall Precision and recall use Confusion Matrix to communicate to the analyst the degree of error of the model
Confusion Matrix
-623 Molecules being classified by the model as being Non-toxic are actually Toxic (True Negatives)
Conclusion To conclude I have found out that these are the metrics for evaluating a binary classification model.
Recommendation Model performance can be improved by improving the training time It's supposed to change actually that is if we pass in let's say 10 minutes the same as 600 seconds in the time_sec_budget then the model is supposed to train for 600 seconds that's 10 minutes, if we pass on 1800sec then the model is supposed to train in 30 minutes, which I have done but not working. Another way to improve the model performance is by resampling the data this can be done in two parts, first a training set and second a validation set or hold-out set.
Hello, @GemmaTuron I am done with week 3. Kindly check if everything is working fine. Thank you
Hi @Dorothy2020, your model doesn't look like it's performing well, a hint on how to make it perform better will be to increase the training time and try resampling your data. Also while reporting on the performance of your model you should include the auroc value and comment on it.
Hi @Dorothy2020, your model doesn't look like it's performing well, a hint on how to make it perform better will be to increase the training time and try resampling your data. Also while reporting on the performance of your model you should include the auroc value and comment on it.
@EstherIdabor thank you for this
Hi @Dorothy2020 ,
I am missing a bit of interpretation of the results you are getting, and what could you do to improve the model. Please comment on this issue thread, link it to your outreachy contribution profile and go onto preparing the final application
Hello, @GemmaTuron I have updated everything and now it looks better
Hi @Dorothy2020 ,
I am missing a bit of interpretation of the results you are getting, and what could you do to improve the model. Please comment on this issue thread, link it to your outreachy contribution profile, and go on preparing the final application
Hello, @GemmaTuron I have updated the result and recommendations and now it looks at least better. Dorothy2020 Outreachy Contribution Here is the link, Thank you.
Hello @GemmaTuron Kindly assist me in closing the issue. Thank you
Hello kindly check my updated GitHub issue. Thank you
On Mon, Oct 31, 2022 at 3:54 PM gemmaturon @.***> wrote:
Hi @Dorothy2020 https://github.com/Dorothy2020 ,
I am missing a bit of interpretation of the results you are getting, and what could you do to improve the model. Please comment on this issue thread, link it to your outreachy contribution profile and go onto preparing the final application
— Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/451#issuecomment-1297044943, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3Z2KCQE2MI6LLTVZHFIWLWF66PHANCNFSM6AAAAAAROYBOBQ . You are receiving this because you were mentioned.Message ID: @.***>
Model Title
Tox21 NR-PPAR-gamma (TDCommons)
Publication
Hello @Dorothy2020!
As part of your Outreachy contribution, we have assigned you the dataset "Tox21 NR-PPAR-gamma" from the Therapeutics Data Commons to try and build a binary classification ML model. Please copy the provided Google Colab template and use this issue to provide updates on the progress. We'll value not only being able to build the model but also interpreting its results.
Code
No response