Is there problem in my implementation

dariasor / TreeExtra

Additive Groves, Bagged Trees with Feature Evaluation, Interaction Detection, Visualization of Feature Effects.

http://dariasor.github.io/TreeExtra/

Other

66 stars 20 forks source link

Is there problem in my implementation #9

Open traderforce opened 4 years ago

traderforce commented 4 years ago

I want to detect interaction and I use real dataset in paper: Detecting Statistical Interactions with Additive Groves of Trees, but cannot get the same result.

I use the real dataset Kinematics, which use parameters like above. I split the data, four part is train, the remain one is validation file. The attribute file is like this:

then is the train file:

As illiustrated on the website material, I first run ag_train with parameters like below.

log.txt shows like this:

the parameters are different from illustrated on paper. Then I do the feature selection with the parameters trained model give, it tells only four features are used , so I wonder what problems are in my step.

Thank you.

dariasor commented 4 years ago

Kinematics has multiple data sets, did you use specifically kin8nm ? The package and the algorithms have evolved since the paper was published, but it is strange that a much smaller model is chosen and no expansion by bagging is suggested. Can you send me the output file called performance.txt as well as your split of the data set? I'll take a look. e-mail is fisharik@gmail.com

traderforce commented 4 years ago

Much thanks for your help, I have sent the files to your gmail and my email account is qtc_trader@163.com. Details are written in the email.

dariasor commented 4 years ago

Thanks! It seems that with this train/validation data split we don't get a good large model needed for fune-tuned interaction detection. These public repositories are not huge, and it is quite possible that there is a lot of variance between different splits. Did you randomize the data before splitting it? I definitely remember that we did for the paper. If the order of the data points is non-random, you end up with added difference between train and validation data sets, and smaller model as a result.