XGBoost get_dump missing information for multiclass classifiers

dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

https://xgboost.readthedocs.io/en/stable/

Apache License 2.0

26.14k stars 8.71k forks source link

XGBoost get_dump missing information for multiclass classifiers #6623

Open Maimonator opened 3 years ago

Maimonator commented 3 years ago

Hey there! As taken from the comment here:

To determine prediction for a multi-class classifier, we divide the trees into C groups (C = number of classes) and compute the partial sum of outputs for each group.

Note that each class has its own trees, but there is currently no way to associate a tree with its group. The information has to be available somewhere, otherwise prediction wouldn't work at all, but it's not exported when dumping the model. I took a look and it seems that the information is kept under GBTreeModel in the tree_info member. When calling SaveModel it is saved, but when calling DumpModel the only output is the trees and not which group they're associated with.

I'd love to hear your input on this. Thanks and I really appreciate your work!

Maimonator commented 3 years ago

Also I would love to create a PR if this seems worthy :)

trivialfis commented 3 years ago

Indeed it's missing from the model dump. I'm not sure how to inject this information into the dump format, feel free to share your opinion.

On the other hand, the information is saved in JSON model format, which can be obtained by classifier.save_model("model.json").

Maimonator commented 3 years ago

Two possible solutions that I think of are:

injecting it to the JSON under the member 'class' at the root node. This would make the root node look different than other nodes but wouldn't break compatibility for other users that perhaps have no use for that member.
We could have the function get_dump return an additional value which will contain tree_info so usage will look like trees, tree_info = classifier.get_dump(). this will also make it easier to support other formats.

I prefer the 2nd option, but I'm not sure what is your policy regarding backwards compatibility. We could also have an optional parameter for tree_info which will be a list and fill this list in a C-style fashion:

tree_info = [] # to be filled with tree info
trees = calssifier.get_dump(dump_format="json", tree_info=tree_info)

This wouldn't break any compatibility, but isn't as intuitive.

Maimonator commented 3 years ago

@trivialfis WDYT?

trivialfis commented 3 years ago

Sorry for the late reply. @hcho3 Would you like to help taking a look? This will help beyond multi class classification since I also want to add multi target regression.

Both options from @Maimonator looks fine to me.

Maimonator commented 3 years ago

Ok I'll probably open a PR then in the following days :) Thanks!

trivialfis commented 3 years ago

The model dump is inherited from sklearn I believe and xgboost didn't invent the format. I don't use model dump very often so might not be a good source of advice. But it might help if we can take a look into sklearn.

hcho3 commented 3 years ago

I prefer Option 1. There are couple packages that rely on the tree dump (dtreeviz, shap), so it's best to not break backward compatibility.

Maimonator commented 3 years ago

Hey, sorry for the late reply, pretty busy at work. @hcho3 what about an optional parameter for get_dump?

tree_info = [] # to be filled with tree info
trees = calssifier.get_dump(dump_format="json", tree_info=tree_info)

trivialfis commented 3 years ago

Thanks for following up the discussion. @Maimonator I think for "pythonic" code, it's best to avoid mutating inputs.