abduskhazi / PL-Binding-Affinity-Prediction-using-ML

This repository is maintained for the documentation and coding of the MSc project @ Bioinformatics Lab Uni-Freiburg.
MIT License
1 stars 0 forks source link

Investigation of 'families' of descriptors #2

Open abduskhazi opened 2 years ago

abduskhazi commented 2 years ago

Investigate the effect of adding or removing 'families' of descriptors. More specifically, AUTOCORR2d_, Chi, EState_VSA, PEOE_VSA, SMR_VSA, SlogP_VSA, VSAEState*, fr*.

abduskhazi commented 2 years ago

priority 2

abduskhazi commented 2 years ago

Hi All, I have completed the implementation of removing families of features, the following are my observation -

Random forest observations (Execution ID = 3887610308) With 100 trees and Infinite depth

Scoring No Exclusion AUTOCORR2D_* Chi* EState_VSA* PEOE_VSA* SMR_VSA* SlogP_VSA* VSA_EState* fr_*
Training R2 0.972 0.970 0.972 0.972 0.972 0.972 0.972 0.971 0.971
Validation R2 0.803 0.792 0.805 0.802 0.802 0.804 0.802 0.802 0.802
OOB Score 0.799 0.788 0.798 0.798 0.798 0.799 0.798 0.797 0.797

The above is obtained using re-training the model. (Seems to overfit, discussion needed)

Simple Linear Regression (Execution ID = 3767311922)

Scoring No Exclusion AUTOCORR2D_* Chi* EState_VSA* PEOE_VSA* SMR_VSA* SlogP_VSA* VSA_EState* fr_*
Training R2 0.456 0.403 0.454 0.454 0.452 0.454 0.452 0.454 0.429
Validation R2 0.423 0.375 0.424 0.421 0.418 0.423 0.418 0.420 0.392

To get feature families excluded, the data is obtained using the following code

import data_bakery as bakery
X, y, features = bakery.bake_train_Xy_exclude_features_families(["AUTOCORR2D_", ...])

Regards, Abdus Salam Khazi

simonbray commented 2 years ago

I am curious whether you tried removing multiple families?

abduskhazi commented 2 years ago

HI Simon, There are 2^8 combinations and I have not written a script for this yet. I will test it and report in this week.

Regards, Abdus Salam Khazi

simonbray commented 2 years ago

I did a couple of small experiments (both with execution ID 3887610308):

Based on this it seems safe to assume these families are certainly not having a positive impact, maybe in fact a slight negative impact. Maybe you can make an argument AUTOCORR2D_* is helping a bit, but it seems pretty marginal.

I would not spend time testing all permutations (unless you are really enthusiastic), I would invest the time somewhere else.

abduskhazi commented 2 years ago

Hi Simon, Thanks for the input. I will try it out and report the results.

Regards, Abdus Salam Khazi

abduskhazi commented 2 years ago

Random forest investigation:

I tried to remove specific features instead of the families of features. My observation:

I hope this makes our model more robust to predict the current affinity even if there are some features that were not accurately measured/reported.

abduskhazi commented 2 years ago

Pearson Correlation HeatMap. Range = (-Infinity, 1]

correlationAUTOCORR2D_ correlationChi correlationfr_ correlationVSA_EState correlationSlogP_VSA correlationSMR_VSA correlationPEOE_VSA correlationEState_VSA