Investigation of 'families' of descriptors

abduskhazi commented 2 years ago

Investigate the effect of adding or removing 'families' of descriptors. More specifically, AUTOCORR2d_, Chi, EState_VSA, PEOE_VSA, SMR_VSA, SlogP_VSA, VSAEState*, fr*.

abduskhazi commented 2 years ago

priority 2

abduskhazi commented 2 years ago

Hi All, I have completed the implementation of removing families of features, the following are my observation -

Random forest observations (Execution ID = 3887610308) With 100 trees and Infinite depth

Scoring	No Exclusion	AUTOCORR2D_*	Chi*	EState_VSA*	PEOE_VSA*	SMR_VSA*	SlogP_VSA*	VSA_EState*	fr_*
Training R2	0.972	0.970	0.972	0.972	0.972	0.972	0.972	0.971	0.971
Validation R2	0.803	0.792	0.805	0.802	0.802	0.804	0.802	0.802	0.802
OOB Score	0.799	0.788	0.798	0.798	0.798	0.799	0.798	0.797	0.797

The above is obtained using re-training the model. (Seems to overfit, discussion needed)

Simple Linear Regression (Execution ID = 3767311922)

Scoring	No Exclusion	AUTOCORR2D_*	Chi*	EState_VSA*	PEOE_VSA*	SMR_VSA*	SlogP_VSA*	VSA_EState*	fr_*
Training R2	0.456	0.403	0.454	0.454	0.452	0.454	0.452	0.454	0.429
Validation R2	0.423	0.375	0.424	0.421	0.418	0.423	0.418	0.420	0.392

To get feature families excluded, the data is obtained using the following code

import data_bakery as bakery
X, y, features = bakery.bake_train_Xy_exclude_features_families(["AUTOCORR2D_", ...])

Regards, Abdus Salam Khazi

simonbray commented 2 years ago

I am curious whether you tried removing multiple families?

abduskhazi commented 2 years ago

HI Simon, There are 2^8 combinations and I have not written a script for this yet. I will test it and report in this week.

Regards, Abdus Salam Khazi

simonbray commented 2 years ago

I did a couple of small experiments (both with execution ID 3887610308):

removing all families: training R2=0.970, validation R2=0.791, oob score = 0.786
removing all families except AUTOCORR2D_*: training R2=0.972, validation R2=0.803, oob score = 0.798

Based on this it seems safe to assume these families are certainly not having a positive impact, maybe in fact a slight negative impact. Maybe you can make an argument AUTOCORR2D_* is helping a bit, but it seems pretty marginal.

I would not spend time testing all permutations (unless you are really enthusiastic), I would invest the time somewhere else.

abduskhazi commented 2 years ago

Hi Simon, Thanks for the input. I will try it out and report the results.

Regards, Abdus Salam Khazi

abduskhazi commented 2 years ago

Random forest investigation:

I tried to remove specific features instead of the families of features. My observation:

Initially the feature ligand.chi1v was very important according to Gini-importance. When I removed this feature, features like ligan.chi4vor ligand.chi2v became important. I concluded that these are correlated.
To prevent our model being dependent on just one of the features (but still be dependent on the whole family), I use the hyperparameter max_features=0.2.
The above max_features value tries to find a split value among 0.2*num_features which are randomly selected.

I hope this makes our model more robust to predict the current affinity even if there are some features that were not accurately measured/reported.

abduskhazi commented 2 years ago

Pearson Correlation HeatMap. Range = (-Infinity, 1]

I removed zero columns in each of the families. With zeros we were getting NAN values in the correlation.
The y label in the map shows the minimum correlation.

correlationAUTOCORR2D_ correlationChi correlationfr_ correlationVSA_EState correlationSlogP_VSA correlationSMR_VSA correlationPEOE_VSA correlationEState_VSA

abduskhazi / PL-Binding-Affinity-Prediction-using-ML

Investigation of 'families' of descriptors #2