Open abduskhazi opened 2 years ago
priority 2
Hi All, I have completed the implementation of removing families of features, the following are my observation -
Random forest observations (Execution ID = 3887610308) With 100 trees and Infinite depth
Scoring | No Exclusion | AUTOCORR2D_* | Chi* | EState_VSA* | PEOE_VSA* | SMR_VSA* | SlogP_VSA* | VSA_EState* | fr_* |
---|---|---|---|---|---|---|---|---|---|
Training R2 | 0.972 | 0.970 | 0.972 | 0.972 | 0.972 | 0.972 | 0.972 | 0.971 | 0.971 |
Validation R2 | 0.803 | 0.792 | 0.805 | 0.802 | 0.802 | 0.804 | 0.802 | 0.802 | 0.802 |
OOB Score | 0.799 | 0.788 | 0.798 | 0.798 | 0.798 | 0.799 | 0.798 | 0.797 | 0.797 |
The above is obtained using re-training the model. (Seems to overfit, discussion needed)
Simple Linear Regression (Execution ID = 3767311922)
Scoring | No Exclusion | AUTOCORR2D_* | Chi* | EState_VSA* | PEOE_VSA* | SMR_VSA* | SlogP_VSA* | VSA_EState* | fr_* |
---|---|---|---|---|---|---|---|---|---|
Training R2 | 0.456 | 0.403 | 0.454 | 0.454 | 0.452 | 0.454 | 0.452 | 0.454 | 0.429 |
Validation R2 | 0.423 | 0.375 | 0.424 | 0.421 | 0.418 | 0.423 | 0.418 | 0.420 | 0.392 |
To get feature families excluded, the data is obtained using the following code
import data_bakery as bakery
X, y, features = bakery.bake_train_Xy_exclude_features_families(["AUTOCORR2D_", ...])
Regards, Abdus Salam Khazi
I am curious whether you tried removing multiple families?
HI Simon, There are 2^8 combinations and I have not written a script for this yet. I will test it and report in this week.
Regards, Abdus Salam Khazi
I did a couple of small experiments (both with execution ID 3887610308):
Based on this it seems safe to assume these families are certainly not having a positive impact, maybe in fact a slight negative impact. Maybe you can make an argument AUTOCORR2D_* is helping a bit, but it seems pretty marginal.
I would not spend time testing all permutations (unless you are really enthusiastic), I would invest the time somewhere else.
Hi Simon, Thanks for the input. I will try it out and report the results.
Regards, Abdus Salam Khazi
Random forest investigation:
I tried to remove specific features instead of the families of features. My observation:
ligand.chi1v
was very important according to Gini-importance. When I removed this feature, features like ligan.chi4v
or ligand.chi2v
became important. I concluded that these are correlated.max_features=0.2
.0.2*num_features
which are randomly selected.I hope this makes our model more robust to predict the current affinity even if there are some features that were not accurately measured/reported.
Pearson Correlation HeatMap. Range = (-Infinity, 1]
Investigate the effect of adding or removing 'families' of descriptors. More specifically, AUTOCORR2d_, Chi, EState_VSA, PEOE_VSA, SMR_VSA, SlogP_VSA, VSAEState*, fr*.