Feature importance in R

tqchen commented 9 years ago

Since the feature importance function issue was discussed in the cross validation thread #114 , and it appears to me that it is worth to open an independent thread for the merging of feature importance function. Let us move new discussion and enhancement with respect to feature importance to this thread

@pommedeterresautee @hetong007

pommedeterresautee commented 9 years ago

Regarding the example data, in the file you pointed to me, there are 126 feature names, and in the data included in the file, there are 127 feature... not far but not Ok.

> str(agaricus.test)
List of 2
 $ label: num [1:1611] 0 1 0 0 0 0 1 0 1 0 ...
 $ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:35442] 0 2 7 11 13 16 20 22 27 31 ...
  .. ..@ p       : int [1:128] 0 83 84 806 1419 1603 1611 2064 2064 2701 ...
  .. ..@ Dim     : int [1:2] 1611 127
  .. ..@ Dimnames:List of 2
  .. .. ..$ : NULL
  .. .. ..$ : NULL
  .. ..@ x       : num [1:35442] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..@ factors : list()

Is there a column you remember having added? After transformation of the sparse Matrix in dense one, I noticed the last column is always 0.

test <- as.matrix(agaricus.test$data)
> sum(test[,127])
[1] 0

I may add a name in the name list as "Intercept", or remove the last column, it's up to you ! (with great power comes great responsabilities :-) )

Kind regards, Michaël

Edit: FYI adding a new name in the list of feature name without understanding the meaning of the last matrix column or if it has any use works very well :-) (just tried it)

tqchen commented 9 years ago

I think it is because we mistakenly make the matrix, there are only 126 features indeed in the original dataset. So let us remove the last column

Thanks!

pommedeterresautee commented 9 years ago

In the pull request I have regenerated the documentation for every method. It seems that few of them have now a new parameter called "missing". I don't know what it is. Let me know if you want me to remove reference to it in the documentation.

tqchen commented 9 years ago

thanks, it is now merged into the master branch! @hetong007 Let us add documentation of the parameters missing.

pommedeterresautee commented 9 years ago

in his book, Kevin Murphy wrote a paragraph regarding interpretation of ensemble models, he tells about counting the number of times a feature has been used (as implemented in the today merged function) and gives a pointer to a paper Using Sensitivity Analysis and Visualization Techniques to Open Black Box Data Mining.

I have read few blog articles regarding this method. It seems interesting. I don't know if it is hard to implement (it seems it is not that much), but it may be a bit far from the initial purpose of your algorithm. There are already existing lib to do that in Python and R. May be making these lib compliant with your program is a better option than implementing everything by yourself.

Finally I found this paper. It gives a new method to determine variable importance in a boosting tree model. What is interesting is not the new method by itself but the performance comparison with a much simpler method. Basically the simple method have very good performance and the new one is just a small improvement over it. The simple method is a bit more complex that what we are doing now, it uses the contribution of each feature in each tree, then average it for all trees. To implement that we need the contribution information in the model tex format dump which is not yet the case (if I understand correctly the dump). I think that if we use the contribution, the issue I noticed before (when there are plenty of deep trees in the model, not important features makes the counting method we are using less performant) should disappear.

Would it be possible to have this information easily?

Kind regards, Michael

tqchen commented 9 years ago

OK, you have it now in the master branch. Try dump with.stats=TRUE

tqchen commented 9 years ago

It would be great if you can contribute a short walkthrough in https://github.com/tqchen/xgboost/tree/master/R-package/demo

This might also be the right place for demonstrating feature engineering example code

tqchen commented 9 years ago

You may need to re-install the package. Because the C source code changed. On my computer I did get he parameter works

Tianqi

On Mon, Dec 29, 2014 at 8:27 PM, Michaël B notifications@github.com wrote:

@tqchen https://github.com/tqchen I just tried the new option which doesn't seem to work.

library(xgboost)> data(agaricus.train, package='xgboost')> data(agaricus.test, package='xgboost')> train <- agaricus.train> test <- agaricus.test> bst <- xgboost(data = train$data, label = train$label, max.depth = 2,+ eta = 1, nround = 2,objective = "binary:logistic") [0] train-error:0.046522 [1] train-error:0.022263> xgb.dump(bst, 'xgb.model.dump', with.stats = T) [1] TRUE

Produces :

booster[0]: 0:[f28<1.00001] yes=1,no=2,missing=2 1:[f108<1.00001] yes=3,no=4,missing=4 3:leaf=1.85965 4:leaf=-1.94071 2:[f55<1.00001] yes=5,no=6,missing=6 5:leaf=-1.70044 6:leaf=1.71218 booster[1]: 0:[f59<1.00001] yes=1,no=2,missing=2 1:leaf=-6.23624 2:[f28<1.00001] yes=3,no=4,missing=4 3:leaf=-0.96853 4:leaf=0.784718

This is the exact same file than the one generated with with.stats = F. I tried to follow the code calls, and until it comes to C part I see nothing special.

Do you know what I am missing?

Regards, Michaël

— Reply to this email directly or view it on GitHub https://github.com/tqchen/xgboost/issues/123#issuecomment-68253458.

Sincerely,

Tianqi Chen Computer Science & Engineering, University of Washington

pommedeterresautee commented 9 years ago

@tqchen Yeah I removed my comment just after posting it as it was exactly the error you described, the dll have not been replaced after package compilation. Windows is so funny sometimes. I am now checking the results to see if it matches my manual analysis of my own dataset. Sorry for sending false bug comment!

Regarding the results it produces, I am wondering what is the best way to classify features. I am thinking to weight the gain by the cover, sum up the result per feature and then transform to percentages to make easier to interpret. Other idea is to not use coverage at all, just sum up the gain per feature and transform to percentages.

Does both methods make sense to you?

Kind regards, Michaël

tqchen commented 9 years ago

I think just using gain is sufficient. The gain is already times scaled by cover and we do not need to time it again

Tianqi

On Mon, Dec 29, 2014 at 9:07 PM, Michaël B notifications@github.com wrote:

@tqchen https://github.com/tqchen Yeah I removed my comment just after posting it as it was exactly the error you described, the dll have not been replaced after package compilation. Windows is so funny sometimes. I am now checking the results to see if it matches my manual analysis of my own dataset. Sorry for sending false bug comment!

Regarding the results it produces, I am wondering what is the best way to classify features. I am thinking to weight the gain by the cover, sum up the result per feature and then transform to percentages to make easier to interpret. Other idea is to not use coverage at all, just sum up the gain per feature and transform to percentages.

Does both methods make sense to you?

Kind regards, Michaël

— Reply to this email directly or view it on GitHub https://github.com/tqchen/xgboost/issues/123#issuecomment-68255867.

Sincerely,

Tianqi Chen Computer Science & Engineering, University of Washington

pommedeterresautee commented 9 years ago

Is there something else to do on Variable Importance? If not, you can decrease the open issues by 1 :-)

BTW I am wondering if it is ok for you if I bring the linear model text dump from unity branch to master branch?

Kind Regards, Michaël

tqchen commented 9 years ago

I think this is now in great shape:) Thanks!

The unity branch will be merged into master in the near future. But yes I think it is OK to bring the linear model dump to master.

pommedeterresautee commented 9 years ago

Just for aesthetic and making a little bit easier parsing, would it be possible to add coma in these places (text format tree dump):

booster[0]:
0:[f0<1.00001] yes=1,no=2,missing=2 gain=9.00675,cover=21
    1:[f3<62.5] yes=3,no=4,missing=3 gain=0.588164,cover=10.75
        3:leaf=-1.36842cover=8.5

Between missing and gain, and between leaf and cover.

I am trying to write a tree plot function and I am wondering what is the meaning of the metric in front of word leaf? (is it homogeneity in the branch? enthropy? Gini?...). I am wondering because I am not sure it makes sense to include leaf in the tree. Btw I will use this project https://github.com/knsv/mermaid for that purpose.

Moreover, when there are several trees, is there a way to represent all trees as a big one? In random forest I have understood that all trees vote together, so all trees are necessary. In boosting trees I am not sure to understand how the vote is done between the trees. I would prefer having one big tree than several small one (in the plot). It would be easier to read.

tqchen commented 9 years ago

The master is now updated to add the comma you mentioned.

The thing before leaf and split condition is the tree node index. So root node is 0, the yes branch goes to node 1 and yes branch of node 1 goes to node 3.

pommedeterresautee commented 9 years ago

The way to make a path in the tree is one of the very few things I have understood by myself :) What i am wondering is the metric included in the leaf lines. What does it represent? I understand it is the end of a branch, so I imagine it s a metric of the data included in this branch. But what is the real meaning. Moreover there is a link between the branch but is there a link between the trees (i have edited my previous question).

Kind regards, Michaël

tqchen commented 9 years ago

for leaf branch, the only meaningful metric is the cover. Cover is defined by the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch (that was why it is named as cover).

In terms of prediction, there is no big difference between RF and boosted tree, they are all ensemble tree models. boosted tree only differs from RF in terms of how you trained it. So they should have same model representation

You may be interested in checkout http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf for concept separation of model, objective and training

pommedeterresautee commented 9 years ago

Thanks for all these information. I have written the code to convert this

booster[0]:
0:[f0<1.00001] yes=1,no=2,missing=2 gain=9.00675,cover=21
    1:[f3<62.5] yes=3,no=4,missing=3 gain=0.588164,cover=10.75
        3:leaf=-1.36842cover=8.5
        4:[f3<65.5] yes=7,no=8,missing=7 gain=0.307692,cover=2.25
            7:leaf=-0cover=1
            8:leaf=-0.666667cover=1.25
    2:[f3<39] yes=5,no=6,missing=5 gain=3.19787,cover=10.25
        5:leaf=-0.909091cover=1.75
        6:[f3<61.5] yes=9,no=10,missing=9 gain=5.62929,cover=8.5
            9:[f3<51.5] yes=11,no=12,missing=11 gain=0.405797,cover=4.75
                11:leaf=0.222222cover=1.25
                12:leaf=1.11111cover=3.5
            10:[f2<1.00001] yes=13,no=14,missing=14 gain=0.361134,cover=3.75
                13:leaf=-0.8cover=1.5
                14:[f3<67.5] yes=15,no=16,missing=15 gain=1.42308,cover=2.25
                    15:leaf=0.5cover=1
                    16:leaf=-0.666667cover=1.25

To that: rplot01

I still need to wait a package to be pushed to Cran before pushing the plot function to Xgboost master. Moreover I have few questions (not all important):

are you Ok with aesthetics choices? (told you not all question important)
why for one hot encoded variable the text format model dump indicates <1.0001 instead of < 1 ?
is there a way to only select the most interesting trees if there are plenty of them? I first thought looking to the sum of the gain in the tree, but I changed my mind because each tree is trained on a subset of observation and on a subset of features (from what I understood) but it would be very good to not have to plot 1000 trees for a big model. However, as per my understanding of the algorithm, first trees make most of the work, and final one focus on the part not yet learned by the previous one, so it would make sense to add a parameter which limits the plot to the n first trees, is that correct?
do you see errors in the tree I have generated? The tree has been built based on the demo code I've written. Basically, numbers are ages of patient, green they are ok, and the feature 0 is if they get a placebo or not.
what is the meaning of "missing" in each branch. Sometimes it s the feature of yes other time the one of no. I suppose it is how the missing value will be managed in this branch, but I am not sure.

Of course, I still need to replace the feature value ID by real values and add the cover in the leaf.

Not related, it's fascinating looking at model plotted, so interesting. In the example above, feature importance showed that two features were important, age of patient (very important) and placebo (just important). Here we can see that the first one used is placebo and not age (which is used almost everywhere else). In some way as a binary value, it s not surprising that using plaebo first provides more gain than splitting the age but still, it gives lots of understanding of the model.

Edit: after test it appears that plotting 10 trees with deepness of 10 takes around 10/15 seconds. I wouldn't try 1 000 trees... (1 tree takes no time... complexity doesn t seem to be linear).

tqchen commented 9 years ago

If the feature map is provided during dump and it contains the type of feature, then the dumped result will not contain the split value for the categorical feature. See dump.nice.txt if you run demo/binary_classification

tqchen commented 9 years ago

Since you mention visualization of a tree. You might be interested in this

https://github.com/CSE512-14W/fp-tianyizh-tqchen-luheng

This was a course project I did that visualize and allows you to interact with tree growing process. If you have a linux machine, you can checkout the project and play with it. The current version of xgboost no longer support the protocol due to refactoring. So you want to change installation script to use the legacy release

wget https://github.com/tqchen/xgboost/archive/v0.21.tar.gz
tar xf xgboost-0.21.tar.gz
mv xgboost-0.21 xgboost
cd xgboost;make;cd ..
./startserver.sh

pommedeterresautee commented 9 years ago

AWESOME!

From my experience at work, data visualization have a big impact on top management and clients (meaning -> people with $$$ to allocate to projects). Much more impact than an improvement of 10% in the accuracy of a model. And if the growing is done in live, it would be even better. I will give a try to your algo tonight (at work only Windows...) to see what it looks like.

On R there is this server tool: http://shiny.rstudio.com/gallery/ . I have seen plenty of very good stuff done with it. All users say it is easy to use. In some way, if current implementation of the algorithm gives me a way to access to the trees during the growing (through R), I should be able to come with an implementation in Shiny. From what I understand from this document it will be less than what you did, but it will be still be possible to improve in the future.

Do you think that it would be difficult to have function showing me the model dump during the growing (in R)?

Kind regards, Michaël

tqchen commented 9 years ago

The original tool was done using D3 for visualization and cgi script. I think lots of existing visualization framework was based on D3. What we did for multiple trees was to collapse trees when there are too many of them and allow user to click and see any specific trees

pommedeterresautee commented 9 years ago

@tqchen I think this feature is correctly implemented for now. If you think it is, don't hesitate to close this issue :)

FYI I will add an importance feature plot function to xgboost I am using at work in the coming weeks (when work will be a bit less intensive). It's just classified horizontal histograms (it reads very well). I have added clustering of features (based on Gain) for color of bars (so you can easily see groups of important features). If you have other ideas, don't hesitate to share.

tqchen commented 9 years ago

Thanks @pommedeterresautee for the great job! This is a nice addition to the package

dmlc / xgboost

Feature importance in R #123

Sincerely,

Sincerely,