dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.35k stars 8.73k forks source link

[Call for help] Bringing the R package up-to-date. #9475

Open trivialfis opened 1 year ago

trivialfis commented 1 year ago

We are maintaining the R package with the best effort, but due to resource limitations, the R package feature has been lagging behind compared to the Python package. Here is a list of things that can be improved. Contributions are appreciated!

trivialfis commented 1 year ago

As a side note, at the moment I'm the maintainer of the CRAN package. If there are contributors who want to share this I'm more than happy to assist and change the maintainership. I have some familiarity with R for statistical computing, but it's not my daily drive and it's unlikely that I will gain a deep understanding of the community in near future.

jameslamb commented 1 year ago

👋🏻 I'd be happy to help out more with the R package, if you're open to it. I'll try some more small PRs to get familiar with XGBoost generally and the R package specifically before addressing the specific features you've listed.

trivialfis commented 1 year ago

@jameslamb Thank you for working on the R package! Feel free to ping me if there's anything I can help.

dfsnow commented 1 year ago

@trivialfis I'm interested in tackling the first issue: adding categorical support to the existing R implementation. I should be able to get it done within the next month or two. Are you open to me creating a draft PR in a fork and pinging you + @jameslamb when it's ready?

trivialfis commented 1 year ago

Thank you for volunteering! Feel free to ping me.

mayer79 commented 1 year ago

Great initiative, thanks a lot @trivialfis :

  1. If @david-cortes has a little bit of spare time, he would be our man here. I can also help, but need a little bit of time to get used to the code base.
  2. I'd love to see multi-target support in R! Same for the multi-quantile loss.
  3. Do we already have SHAP values for the multi-target (and multi-quantile) situation (not only in R, but in XGBoost in general)?
david-cortes commented 1 year ago

Sure I could help if needed.

But I think a first step would be to decide on what the final result should look like, how far are the changes meant to be, and whether we're going to hold off the R release until all changes are complete (e.g. by keeping a separate branch) or introduce them gradually alongside with releases; or maybe even release under a different name like xgboost2.

Right now, xgboost works very differently from base R's stats::glm, and from most other R packages for statistical modeling, whether tree-based (e.g. ranger, randomForest, gbm3, etc.) or linear-based (e.g. VGAM, glmnet, glmx, ncvreg, grpreg, etc.); both when fitting models and when making predictions. Using xgboost in R without any wrapper is quite inconvenient and requires a lot more code than other packages.

Changing the interface to be more idiomatic would be nicer for users, and I think this major version jump is the best chance to introduce big changes as there probably won't be any next major release any time soon, if ever.

However, there's a lot of packages that depend on xgboost in some way (currently 133 listed in CRAN alone), plus lots of scripts in the wild, and for a package as widely used, any breaking change would lead to a large amount of frustrated users, plus would involve notifying the maintainers of those CRAN/bioconductor packages and perhaps subimitting PRs to them.

So I'd say let's first clarify on what's acceptable to change and what isn't.

For example, off the top of my mind, I can think of the following un-idiomatic behaviors that could be improved:

Some of these would involve large breakage in reverse dependencies, while others wouldn't, but I think the biggest gains here are precisely the ones that would cause breakage.

hcho3 commented 1 year ago

there's a lot of packages that depend on xgboost in some way (currently 133 listed in CRAN alone), plus lots of scripts in the wild, and for a package as widely used, any breaking change would lead to a large amount of frustrated users, plus would involve notifying the maintainers of those CRAN/bioconductor packages and perhaps subimitting PRs to them.

I am personally in favor of creating {xgboost2} to avoid this very issue. Already, XGBoost 2.0 breaks multiple reverse dependencies and hence it's not yet available on CRAN.

mayer79 commented 1 year ago

@hcho3: I wanted to propose {xgboost2} as well, but did not dare :-)

trivialfis commented 1 year ago

I'm happy with xgboost2. Should it be a different package or can live side by side with the existing one?

hcho3 commented 1 year ago

@trivialfis We can host the R code in this repository, but we'd submit the code as {xgboost2} to CRAN. CRAN will still have the old XGBoost as {xgboost}.

trivialfis commented 1 year ago

but we'd submit the code as {xgboost2} to CRAN. CRAN will still have the old XGBoost as {xgboost}.

I will leave that for the actual implementation to decide. I'm fine with having a different CRAN package or continue to use the existing one with a new interface.

hcho3 commented 1 year ago

I see. So we could define new functions suffixed with Ex or ng or something like that, if we want to keep using {xgboost} CRAN package. My personal preference is to create new CRAN package {xgboost2}, for three reasons:

  1. We can use most natural names for functions, instead of adding awkward suffixes.
  2. By creating a new CRAN package, we are not weighed down by legacy functions breaking tests or reverse dependency check on CRAN. The more functions we have in the package, the more burdensome it will be to maintain the CRAN package.
  3. By starting fresh, we can make great changes to the R package without worrying about breaking legacy methods.

What do you think? @david-cortes @mayer79 @jameslamb @dfsnow

hcho3 commented 1 year ago

@trivialfis Also, we need to decide whether we want to try to submit the current XGBoost 2.0 to CRAN. My guess is that it would be a lot of work; correct me if I'm wrong. For example, we'd need to submit pull requests to reverse dependencies. The new requirement for limiting multithreading on the test farm is proving to be a headache as well.

My personal preference is to not bother with submitting current XGBoost code to CRAN. Let's start making great changes to the R package now, and later we can submit {xgboost2}.

trivialfis commented 1 year ago

The new requirement for limiting multithreading

This one is solved after many debugging efforts during the 2.0 release. The reverse dependency is a little bit soul-crushing though, which I haven't been able to finish.

We can use most natural names for functions, instead of adding awkward suffixes.

By starting fresh, we can make great changes to the R package without worrying about breaking legacy methods.

I'm totally on board with this, it's just whether @david-cortes and @mayer79 want to take an incremental approach or start anew, or somewhere in the middle with a transition peroid. Either way, I'm happy to assist and will try to support whatever decision is made by the implementor.

mayer79 commented 1 year ago

We would probably write the train/cv-API from scratch, but recycle the internals that wrap the C++ functions. I will start to study the current functions this weekend.

david-cortes commented 1 year ago

I don't know what's the best way forward here. I see several potential routes.

(a) Release under name xgboost2, holding off the release until the hypothetical new R interface is finished.

(b) Release an xgboost==2.0.0 now, and make gradual changes to the R interface with regular releases.

(c) Release under name xgboost, holding off v2.0 release until the hypothetical new R interface is finished, then release with this new interface replacing the old one.

(d) Release under name xgboost, with an additional new interface having different names from the older one, either holding off v2.0 release until it's ready, or releasing gradually.

From these options, I guess (d) is the most "correct" one. (c) is very tempting for me. (a) is suboptimal for everyone but perhaps a feasible compromise. (b) I don't like but I guess it's the way most high-profile packages operate.

Regardless of the choice, it'd nevertheless be helpful if we could come up with a roadmap with checklists of all the things that need to be changed, and perhaps a discussion thread to discuss about what changes are needed or welcomed (e.g. to discuss about how to deal with categorical features in the hypothetical formula interface). And of course would be helpful if this roadmap could be kept up-to-date from those discussions and updated as PRs are merged. And also to make a distinction between user-facing and non-user-facing - for example, using inplace predict makes no difference in the external interface, and would thus not block a release or block other PRs, whereas breaking down the model-fitting function into a formula and an x/y interface would be a big user-facing change.


As a different topic, I am also wondering about timelines and mode of development. For example, lightgbm has also been switching away from the interface it initially copied from xgboost, with these changes having been in small PRs each with lots of tests. This has been ongoing for 2 years and it's still not finished.

In the case of xgboost, I guess maintainer bandwidth is higher, the R interface is a comparatively easier codebase (e.g. no R6), and the amount of desired changes is much smaller; but introducing changes in granular PRs that each have dependencies on each other is a long development process, the moreso if we want it very thoroughly tested.

I'd say introducing a new interface in one go where everyone in this thread makes comments would make things easier, but it's more prone to bugs and hard to review. Not sure what maintainers here would think though.


On yet another topic, I am wondering if it'd be possible to get helper features in the C interface from the core xgboost developers in order to improve the R interface. For example, I had a previous (unsuccessful) go at modifying parts of the predict function here and would find it very helpful if for example there was a C-level functionality to distinguish which objectives are for classification, and how many classes they involve.

Likewise, it'd be helpful if for example the C interface could take a parameter index1 as part of the input JSON and use 1-based indexing everywhere if so (e.g. when numerating nodes or boosting rounds, when refering to column indices of interaction constraints, etc.), but that's quite outside of the scope of an R interface, even though it'd make it more in line with the rest of the R ecosystem.


As yet another comment, the R interface currently has a divison between Booster and Booster.handle. I don't think many people or reverse dependencies are using the Booster.handle directly (such an object is kept inside of the Booster, which is what the main functions return), and removing its methods from the public interface is probably going to make the codebase easier to understand and to develop, as well as make the docs easier to follow. I think this could be a good opportunity to make the Booster.handle internal-only.

hcho3 commented 1 year ago

I'd say introducing a new interface in one go where everyone in this thread makes comments would make things easier, but it's more prone to bugs and hard to review.

We can mitigate the disadvantage of this approach by first doing a design review. Contributors would review the proposed interface and see if it is reasonable. As the most impactful changes are user-facing, design review will be useful.

Personally, as long as the new interface is more ergonomic and idiomatic R, I would not quibble with minor implementation details too much. I am no expert of R, and I would have to defer to you when it comes to nitty gritty details of programming in R.

hcho3 commented 1 year ago

I am wondering if it'd be possible to get helper features in the C interface from the core xgboost developers in order to improve the R interface.

Yes, I am more than happy to help.

trivialfis commented 1 year ago

I had a previous (unsuccessful) go at modifying parts of the predict function https://github.com/dmlc/xgboost/pull/7947 and would find it very helpful if for example there was a C-level functionality to distinguish which objectives are for classification, and how many classes they involve.

Let's start with this. I will look into the issue. Related https://github.com/dmlc/xgboost/issues/9640 .

Likewise, it'd be helpful if for example the C interface could take a parameter index1 as part of the input JSON and use 1-based indexing everywhere

I think this will probably be an R-specific issue, considering that it's the only interface that uses a 1-based index. I don't think solving it in C is easier or cleaner than solving it in R. Maybe we can have some utilities in R to handle all indexing.

trivialfis commented 1 year ago

@david-cortes Thank you for the very detailed and informative breakdown of various options and obstacles. Here's my two cents:

I think we all agree that a huge set of breaking changes is inevitable. It's not quite productive to think about compatibility when the goal is to rewrite everything, The question is how to handle reverse dependencies after we introduce the brand-new interface. This is orthogonal to how we will develop the new R package, we can worry about the CRAN package after we have finalized the new interface. The development of the new package should focus on the package itself and bear no historical burden. So, let's take a step back and focus on the new interface for now.

For the actual development, I agree that either a checklist or a design doc would be helpful for everyone to see the direction and status. We can also discuss some technical details in the doc. Feel free to open a new issue for this. (or use google doc if you prefer). For instance, we can sort out the details for the predict function and what's needed to implement it in the design doc.

I'd say introducing a new interface in one go where everyone in this thread makes comments

After having the tracking issue/design doc, I think we should merge PRs progressively even if there's a feature-complete prototype (we can split it up into multiple PRs). This is a good practice for reviews and for organizing code for unit tests.

In the end, I think we will follow the usual steps:

I can host online chats using teams if needed, particularly for working on C-related primitives. ;-)

trivialfis commented 1 year ago

's not quite productive to think about compatibility when the goal is to rewrite everything, The question is how to handle reverse dependencies after we introduce the brand-new interface. This is orthogonal to how we will develop the new R package,

Just to clarify, I don't mean to break everyone's code. I think we will either launch a new package or launch a new interface without touching the existing one. Among the options listed by @david-cortes , I think we can agree that we will have two interfaces by the end, the question is how to introduce the new one to everyone.

david-cortes commented 1 year ago

Created a different issue to discuss details of what the R interface should look like and discuss about some potential issues in the implementation: https://github.com/dmlc/xgboost/issues/9734

Would be ideal if people in this thread could take a look and comment there before starting with the coding.

dfsnow commented 1 year ago

Given the plan to create an entirely new interface, is there still a reason to work on the existing one? Specifically, I planned to add categorical support sometime before the end of the year, is there still an appetite for that PR?

As an aside, while I love the idea of a better maintained/more idiomatic R interface, I'm a bit nervous about a full rewrite. My concerns as a mostly uninformed outside observer are:

Basically, I'm worried that what we'll end up with is a a mostly unmaintained xgboost package, a new xgboost2 package a year from now, and an even bigger maintenance burden for @david-cortes and @trivialfis. That said, I don't have the detailed knowledge of the R interface necessary to estimate the amount of work involved here, so I could be way off base.

david-cortes commented 1 year ago

Given the plan to create an entirely new interface, is there still a reason to work on the existing one? Specifically, I planned to add categorical support sometime before the end of the year, is there still an appetite for that PR?

As an aside, while I love the idea of a better maintained/more idiomatic R interface, I'm a bit nervous about a full rewrite. My concerns as a mostly uninformed outside observer are:

  • The R package struggled to find maintainers before, so it seems dubious that capacity exists for a full rewrite.
  • Adding new features (formulas, split high/low interface) will increase the maintenance burden even further.
  • Transitioning to a new package (xgboost2) is notoriously hard and is likely to split the ecosystem of users.

Basically, I'm worried that what we'll end up with is a a mostly unmaintained xgboost package, a new xgboost2 package a year from now, and an even bigger maintenance burden for @david-cortes and @trivialfis. That said, I don't have the detailed knowledge of the R interface necessary to estimate the amount of work involved here, so I could be way off base.

It doesn't need to be a full re-write since ideally the higher-level interface would be calling the lower-level interface (which likely won't end up with many changes from current code) in most cases and adding extra things on top.

So features like adding categorical features for DMatrix from current supported inputs (e.g. matrix, dgCMatrix) or using inplace-predict for matrix objects would not be in vain.

trivialfis commented 1 year ago

Probably need some more fixes on 1.7:

Dear maintainer,

Please see the problems shown on https://cran.r-project.org/web/checks/check_results_xgboost.html.

Please correct before 2023-12-12 to safely retain your package on CRAN.

Best, -k