dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.1k stars 8.7k forks source link

New feature? Java binding for categorical feature support #8727

Open shadyelgewily-slimstock opened 1 year ago

shadyelgewily-slimstock commented 1 year ago

We are using XGBoost using the Java binding (outside of Spark) and we have a strong appetite for categorical feature support, where splits are considered in terms of subset partitioning of the categorical feature as opposed to one-hot encoding and having XGboost considering each category separately. The release notes for v1.6 states:

"In the future, we will continue to improve categorical data support with new features and optimizations. Also, we are looking forward to bringing the feature beyond Python binding, contributions and feedback are welcomed! Lastly, as a result of experimental status, the behavior might be subject to change, especially the default value of related hyper-parameters."

I'm raising this issue because I'm wondering what the status is of the Java binding for the experimental parameters related to categorical features. Concretely:

It seems that some work has already been done on the first two items in (https://github.com/dmlc/xgboost/pull/7966), so perhaps the more general question is:

I see that this feature request is on the roadmap, and we could contribute to help the process move forward.

trivialfis commented 1 year ago

Is there already a way to communicate to the native C code which columns in the DMatrix should be considered as categorical, and which as numeric

Yes. As referred in your description https://github.com/dmlc/xgboost/pull/7966 .

Is there an appetite at XGboost maintainers to release such a Java binding in a stable version any time in the next 3-6 months say

Yes, that would be 2.0 if all goes well.

Which components are still required to start using categorical features

For the Java interface, I think we can already get some small examples running, but haven't been able to prioritize it yet. The feature_type and supported tree_methods are all it needs. However, my understanding is that most users prefer the scala binding over the java binding and we need to extend the feature info setter/getter to scala and have appropriate integration with the spark estimator interface.

wbo4958 commented 1 year ago

Please see this comment. https://github.com/dmlc/xgboost/issues/7802#issuecomment-1407828758