Closed Far0n closed 6 years ago
This sound interesting.
Would love to take it for a test spin. Is the code available?
Hi Inversion :) It's not public yet.
Sounds interesting. Do you want to make a PR on this?
@Far0n - maybe public in 37 days? :-)
@walterreade - I don't know what you are talking about. :P
It's currently written in C# outputing an Excel file (don't blame me for that ^^). I will put in on my github as soon as I'm done with bug fixing and code cleaning.
Looking forward to it. Maybe I'll port it to Fortran. (Or maybe Python.) :-)
Here is an example screenshot from the output:
So far, I'm summing up gain & fscore and compute the ranks according to gain, fscore & gain divided by fscore. Is there any value in using cover somehow?
@tqchen Is the "xgb-gain" the pure information gain or the gain ratio or something similar? Would it makes sense to weight by tree-index => favor features which accumulates gain in early rounds or to normalize the paths in a tree by TotalGain of the current tree?
Another thing I have in mind: Assume we have stored the trace of the error on a validation set (evals_result) -> weight features by validation error reduction per tree.
@Far0n, Do your processing dump file tool outperform the 'R' processing in 'xgb.model.dt.tree.R'? I think so.
When I have more than 5M rows to process in dump file, processing relative importance in R takes more time that model fitting itself (up to 5x) so I think a solution that do the work in C++ would be very interesting, not only for interaction but for univariate relative importance too.
@BlindApe Well the runtime depends on alot of parameters like tree deepening and interaction depth for instance. 10M rows with full deepening for univariate relative importance takes around 2 minutes @2.5 Ghz single CPU.
I've fitted recently a model with 9.9M rows in dump file and relative importance taken nearly two hours in R (@3.0 Ghz, this in R runs in single CPU)
My tool is quite memory hungry for the sake of speed, using alot of dynamic programming / memoization to speed things up.
I just published an early version: https://github.com/Far0n/xgbfi
@tqchen @walterreade @Far0n I think I am able to reproduce the C# in R code for a future PR. However, I am not sure to understand the purpose :-)
Can you please explain what kind of analysis may be done with a classification of the best interaction? Is it a way to do some feature selection? Was it linked to a specific Kaggle task? If yes can you provide me with a link?
Kind regards, Michael
@BlindApe have you benchmark which part of the model parsing is slow on big model? I have done my tests on few hundred of trees model and never had super slow parsing performance. If you want you can share a huge model on some cloud service. You can send it to firstname [at] lastname.fr if you want (name below).
Kind regards, Michael Benesty
@pommedeterresautee Yes you can use that for feature selection, getting a better understanding of the data (especially in cases where the feature meanings are unknown) and to improve linear models by encoding interactions reported by xgb.
@Far0n thanks for the explanation! Have you XGB model to share that is interesting to study with such tool?
Kind regards, Michael
@pommedeterresautee Currently, I have only models from running kaggle competitions, which I can't share atm for obvious reasons and a ~10GB 50k trees model from Springleaf.
Maybe some words to the background: The python wrapper (which I use) only reports fscore as feature importance metric and this one fails - for instance - , if you have differently scaled features. At the Springleaf competition there was some sort of ID-column which was ranked quite high with respect to fscore, but it was a "useless" feature. So I started writing xgbfi to check other metrics. This ID feature also collected a high amount of gain. Hence gain alone was also not suitable to reveal its "uselesness". If you have additional stats like average gain, expected gain or weighted fscore, there is a higher chance to mark these kind of features as "useless", but it still needs some manual inspection. So I think the best would be to rank the features (and interactions) in terms of the performance on hold out data, which could be integrated in xgboosts training algorithm. Besides, I'm working on split value histograms, which could be reported from xgb as well. They would probably add some value for explorative analysises.
@Far0n This is very interesting! Is your split function related to this post: http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html
If yes, just PR it to R package :-) https://github.com/dmlc/xgboost/pull/648
Regarding the best interaction, do you mean the included stat are not enough to do a correct analysis?
Kind regards, Michaël
Regarding the best interaction, do you mean the included stat are not enough to do a correct analysis At least in some cases, when highly ranked features (regarding gain and/or fscore) are leading to overfitting.
@pommedeterresautee fyi, my thoughts about fscore (some hold for gain as well): http://tinyurl.com/q6ktt3k
It is an awesome feature, but no one has mentioned how to actually input the interaction features into xgboost.
Is someone working on integrating this functionality into the source? It's extremely handy for feature engineering.
I threw together something similar: njwilson23.github.io/xgb-ngrams/xgb-ngrams.html
Just a toy - limited at the moment to what can comfortably be copy-pasted into a browser window (maybe a few tens of thousands of rows).
fyi: PR für xgbfi (C++): dmlc/xgboost#2846
Hello all together,
I have written a small tool that extracts n-way feature interactions out of xgb-model dumps and I'm wondering if that would be a useful feature to be directly implemented in xgboost (especially to circumvent the model dumping step). It works as follows:
cheers, Faron