dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.36k stars 8.74k forks source link

Advanced Feature Importance (n-way interactions) #474

Closed Far0n closed 6 years ago

Far0n commented 9 years ago

Hello all together,

I have written a small tool that extracts n-way feature interactions out of xgb-model dumps and I'm wondering if that would be a useful feature to be directly implemented in xgboost (especially to circumvent the model dumping step). It works as follows:

  1. it parses a model-dump file (dumped w/ with_stats=True) and collects all paths of length d (_0 ≤ d ≤ max_interactiondepth) in each tree
  2. it sums the gains linked to each collected path
  3. it computes the relative gain for each path of length d
  4. it outputs _++max_interactiondepth lists each sorted by relative gain

cheers, Faron

BlindApe commented 9 years ago

This sound interesting.

walterreade commented 9 years ago

Would love to take it for a test spin. Is the code available?

Far0n commented 9 years ago

Hi Inversion :) It's not public yet.

tqchen commented 9 years ago

Sounds interesting. Do you want to make a PR on this?

walterreade commented 9 years ago

@Far0n - maybe public in 37 days? :-)

Far0n commented 9 years ago

@walterreade - I don't know what you are talking about. :P

Far0n commented 9 years ago

It's currently written in C# outputing an Excel file (don't blame me for that ^^). I will put in on my github as soon as I'm done with bug fixing and code cleaning.

walterreade commented 9 years ago

Looking forward to it. Maybe I'll port it to Fortran. (Or maybe Python.) :-)

Far0n commented 9 years ago

Here is an example screenshot from the output:

xgbfi_example

So far, I'm summing up gain & fscore and compute the ranks according to gain, fscore & gain divided by fscore. Is there any value in using cover somehow?

@tqchen Is the "xgb-gain" the pure information gain or the gain ratio or something similar? Would it makes sense to weight by tree-index => favor features which accumulates gain in early rounds or to normalize the paths in a tree by TotalGain of the current tree?

Another thing I have in mind: Assume we have stored the trace of the error on a validation set (evals_result) -> weight features by validation error reduction per tree.

BlindApe commented 9 years ago

@Far0n, Do your processing dump file tool outperform the 'R' processing in 'xgb.model.dt.tree.R'? I think so.

When I have more than 5M rows to process in dump file, processing relative importance in R takes more time that model fitting itself (up to 5x) so I think a solution that do the work in C++ would be very interesting, not only for interaction but for univariate relative importance too.

Far0n commented 9 years ago

@BlindApe Well the runtime depends on alot of parameters like tree deepening and interaction depth for instance. 10M rows with full deepening for univariate relative importance takes around 2 minutes @2.5 Ghz single CPU.

BlindApe commented 9 years ago

I've fitted recently a model with 9.9M rows in dump file and relative importance taken nearly two hours in R (@3.0 Ghz, this in R runs in single CPU)

Far0n commented 9 years ago

My tool is quite memory hungry for the sake of speed, using alot of dynamic programming / memoization to speed things up.

Far0n commented 9 years ago

I just published an early version: https://github.com/Far0n/xgbfi

pommedeterresautee commented 9 years ago

@tqchen @walterreade @Far0n I think I am able to reproduce the C# in R code for a future PR. However, I am not sure to understand the purpose :-)

Can you please explain what kind of analysis may be done with a classification of the best interaction? Is it a way to do some feature selection? Was it linked to a specific Kaggle task? If yes can you provide me with a link?

Kind regards, Michael

pommedeterresautee commented 9 years ago

@BlindApe have you benchmark which part of the model parsing is slow on big model? I have done my tests on few hundred of trees model and never had super slow parsing performance. If you want you can share a huge model on some cloud service. You can send it to firstname [at] lastname.fr if you want (name below).

Kind regards, Michael Benesty

Far0n commented 9 years ago

@pommedeterresautee Yes you can use that for feature selection, getting a better understanding of the data (especially in cases where the feature meanings are unknown) and to improve linear models by encoding interactions reported by xgb.

pommedeterresautee commented 9 years ago

@Far0n thanks for the explanation! Have you XGB model to share that is interesting to study with such tool?

Kind regards, Michael

Far0n commented 9 years ago

@pommedeterresautee Currently, I have only models from running kaggle competitions, which I can't share atm for obvious reasons and a ~10GB 50k trees model from Springleaf.

Maybe some words to the background: The python wrapper (which I use) only reports fscore as feature importance metric and this one fails - for instance - , if you have differently scaled features. At the Springleaf competition there was some sort of ID-column which was ranked quite high with respect to fscore, but it was a "useless" feature. So I started writing xgbfi to check other metrics. This ID feature also collected a high amount of gain. Hence gain alone was also not suitable to reveal its "uselesness". If you have additional stats like average gain, expected gain or weighted fscore, there is a higher chance to mark these kind of features as "useless", but it still needs some manual inspection. So I think the best would be to rank the features (and interactions) in terms of the performance on hold out data, which could be integrated in xgboosts training algorithm. Besides, I'm working on split value histograms, which could be reported from xgb as well. They would probably add some value for explorative analysises.

pommedeterresautee commented 9 years ago

@Far0n This is very interesting! Is your split function related to this post: http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html

If yes, just PR it to R package :-) https://github.com/dmlc/xgboost/pull/648

Regarding the best interaction, do you mean the included stat are not enough to do a correct analysis?

Kind regards, Michaël

Far0n commented 9 years ago

Regarding the best interaction, do you mean the included stat are not enough to do a correct analysis At least in some cases, when highly ranked features (regarding gain and/or fscore) are leading to overfitting.

Far0n commented 8 years ago

@pommedeterresautee fyi, my thoughts about fscore (some hold for gain as well): http://tinyurl.com/q6ktt3k

ahmedahmedov commented 7 years ago

It is an awesome feature, but no one has mentioned how to actually input the interaction features into xgboost.

d4tum commented 7 years ago

Is someone working on integrating this functionality into the source? It's extremely handy for feature engineering.

njwilson23 commented 7 years ago

I threw together something similar: njwilson23.github.io/xgb-ngrams/xgb-ngrams.html

Just a toy - limited at the moment to what can comfortably be copy-pasted into a browser window (maybe a few tens of thousands of rows).

Far0n commented 7 years ago

fyi: PR für xgbfi (C++): dmlc/xgboost#2846