New feature? Java binding for categorical feature support

shadyelgewily-slimstock commented 1 year ago

We are using XGBoost using the Java binding (outside of Spark) and we have a strong appetite for categorical feature support, where splits are considered in terms of subset partitioning of the categorical feature as opposed to one-hot encoding and having XGboost considering each category separately. The release notes for v1.6 states:

"In the future, we will continue to improve categorical data support with new features and optimizations. Also, we are looking forward to bringing the feature beyond Python binding, contributions and feedback are welcomed! Lastly, as a result of experimental status, the behavior might be subject to change, especially the default value of related hyper-parameters."

I'm raising this issue because I'm wondering what the status is of the Java binding for the experimental parameters related to categorical features. Concretely:

Is there already a way to communicate to the native C code which columns in the DMatrix should be considered as categorical, and which as numeric?
Provided that we have some way to encode the feature type in the DMatrix or elsewhere, how do we communicate that to the C binding (there has to be some way to achieve this, since the Python binding already exists)
Is there an appetite at XGboost maintainers to release such a Java binding in a stable version any time in the next 3-6 months say, provided we contribute a PR that satisfies the general requirements for a XGboost PR?

It seems that some work has already been done on the first two items in (https://github.com/dmlc/xgboost/pull/7966), so perhaps the more general question is:

Which components are still required to start using categorical features (based on subset partitioning) in Java?
How can we help get this feature into XGboost faster (e.g., by contributing), provided that it is on the roadmap (https://github.com/dmlc/xgboost/issues/7802)?

I see that this feature request is on the roadmap, and we could contribute to help the process move forward.

trivialfis commented 1 year ago

Is there already a way to communicate to the native C code which columns in the DMatrix should be considered as categorical, and which as numeric

Yes. As referred in your description https://github.com/dmlc/xgboost/pull/7966 .

Is there an appetite at XGboost maintainers to release such a Java binding in a stable version any time in the next 3-6 months say

Yes, that would be 2.0 if all goes well.

Which components are still required to start using categorical features

For the Java interface, I think we can already get some small examples running, but haven't been able to prioritize it yet. The feature_type and supported tree_methods are all it needs. However, my understanding is that most users prefer the scala binding over the java binding and we need to extend the feature info setter/getter to scala and have appropriate integration with the spark estimator interface.

wbo4958 commented 1 year ago

Please see this comment. https://github.com/dmlc/xgboost/issues/7802#issuecomment-1407828758

dmlc / xgboost

New feature? Java binding for categorical feature support #8727