dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.12k stars 8.7k forks source link

Roadmap for new R interface #9810

Open david-cortes opened 9 months ago

david-cortes commented 9 months ago

ref https://github.com/dmlc/xgboost/issues/9734 ref https://github.com/dmlc/xgboost/issues/9475

This issue is intended as a roadmap tracker for progress in bringing xgboost's R interface up to date and discussions around these tasks and coordination.

From the previous tasks, here I've made a list of potential tasks to take on, but I might be missing some things, and I've put the biggest task (new xgboost() function) under a single bullet point while in practice it'll likely involve multiple rounds of PRs. Please feel free to add more tasks to this list.

I've taken the liberty of classifying these issues in terms of whether they'd be blockers for releasing a new xgboost version or not, albeit some people might disagree with my assessments.

david-cortes commented 9 months ago

@trivialfis From the previous thread, you mentioned you might be able to work on categorical feature support - would you be able to take on the first two tasks here?

@dfsnow You mentioned that you were willing to help in the earlier topic - would you be interested in taking on some of the issues here, particularly around DMatrix topics?

@jameslamb Would you be interested in taking on some task such as removing the handle class from the public interface?

@mayer79 Are you familiar with C++ and R's C interface? Would you be able to help with some of these topics?

mayer79 commented 9 months ago

@david-cortes: fantastic road map, thank you so much. Unfortunately, you have spotted my biggest weakness! For the C part, we might ask the data.table team. For the C++ part, Dirk Edelbüttel?

trivialfis commented 9 months ago

Let me handle the primitive support for data frame first. Categorical data can follow.

trivialfis commented 9 months ago

Let me handle the primitive support for data frame first. Categorical data can follow.

This is probably going to help with other interfaces as well. We need to have missing data for each column.

trivialfis commented 9 months ago

With the amount of custom C++ code in the R package, I think we need to set up CI tests with sanitizer for R (hopefully not Valgrind, which is slow).

david-cortes commented 9 months ago

Another task which doesn't require modifying any C/C++ functions (only .R files): currently, xgb.cv will error out with objective survival:aft. This is due to the function checking that the DMatrix object has label property, but this objective works instead with label_lower_bound and label_upper_bound.

@mayer79 would you be interested in contributing a fix?

mayer79 commented 9 months ago

Good idea. I even remember this issue from somewhere.

jameslamb commented 9 months ago

@jameslamb Would you be interested in taking on some task such as removing the handle class from the public interface?

Yes definitely!

But it will be about 1-2 weeks until I'm able to spend any time on it, as I'm focusing right now on trying to get {lightgbm} 4.x out to CRAN (and keeping {lightgbm} from being archived there 😬 ).

I'm also happy to help with reviews on any PRs here if you want, just @ me.

david-cortes commented 9 months ago

Since the current master branch now supports multi-quantile regression, I guess it's now time to update the example in the docs where it says

The feature is only supported using the Python package

... and maybe it'd be worth it to add an equivalent R example, if someone would like to take on this task.

trivialfis commented 9 months ago

@david-cortes Out of curiosity, do you want to become the CRAN maintainer after having the new interface (regardless of whether the two interfaces coexist)? At the moment, I'm maintaining the CRAN package but only doing the chores instead of having actual development, it would be great if there's a real expert can take over.

david-cortes commented 9 months ago

@david-cortes Out of curiosity, do you want to become the CRAN maintainer after having the new interface (regardless of whether the two interfaces coexist)? At the moment, I'm maintaining the CRAN package but only doing the chores instead of having actual development, it would be great if there's a real expert can take over.

Thanks for the offer, but I'll pass on it as I'm not certain that I will have the time for that sort of work in the future or the ability to address CRAN issues on time.

That being said, if you ever need help with some issue in the R interface in the future, or would like to me to review some PR, feel free to tag me there if needed.

trivialfis commented 9 months ago

Understood, thank you for the great progress on the R package!

trivialfis commented 8 months ago

Added:

Documentation and unified tests for 1-based indexing.

ref: https://github.com/dmlc/xgboost/pull/9935#issuecomment-1892616474

trivialfis commented 5 months ago

@david-cortes Hi, out of curiosity, how's everything going?

david-cortes commented 5 months ago

Fine, thank you. I'll be pausing work for a while, will probably resume later in May.

trivialfis commented 5 months ago

Good to know, thank you for the update!

david-cortes commented 1 month ago

By this point, we are ready with changes and extensions to the xgb.train interface - only remaining things for this interface are around documentation and bug fixes like https://github.com/dmlc/xgboost/issues/9925

I guess it's now time to start thinking about next steps for a CRAN release, particularly in working with maintainers of reverse dependencies to make the necessary changes and be informed in advanced (as per CRAN policies) about the breaking changes. Not sure how to approach this though.

As for the new xgboost() interface, there is one rather straightforward stream of work which doesn't require much knowledge about either R, C++ or XGBoost in case @jameslamb or @mayer79 or @trivialfis wants to take over: the current ... needs to be replaced with named and documented arguments from all the parameters that xgboost accepts, like in the scikit-learn interface.

These would need to be copy-pasted from the docs about parameters and formatted to render well with roxygen: https://xgboost.readthedocs.io/en/stable/parameter.html

Ideally, there could also be a function xgb.control that would take the exact same arguments (docs for the same arguments for both xgboost() and xgb.control() can be shared through roxygen tags) with everything NULL by default, that would produce a list that could be fed to the params argument in xgb.train, but this is not strictly necessary for the xgboost() interface part.