[Call for help] Bringing the R package up-to-date.

trivialfis commented 1 year ago

We are maintaining the R package with the best effort, but due to resource limitations, the R package feature has been lagging behind compared to the Python package. Here is a list of things that can be improved. Contributions are appreciated!

Categorical data support. We need the R package to recognize the factor type and pass related information into xgboost.
Multi-output models. Even though it's still working in progress, establishing an interface would be nice. (https://github.com/dmlc/xgboost/pull/9526)
Design an interface that's more native to R, for instance, using formulas like what the GLM does. The tidy verse can be a good design reference.
Distributed training, perhaps integration with RSpark.
QuantileDMatrix support can help R users to reduce memory usage.
External memory with DataIter.
quantile regression with multiple quantiles.
Use CMake instead of autotools when possible, this way we can make it easier to support GPU with R.
inplace prediction.

trivialfis commented 1 year ago

As a side note, at the moment I'm the maintainer of the CRAN package. If there are contributors who want to share this I'm more than happy to assist and change the maintainership. I have some familiarity with R for statistical computing, but it's not my daily drive and it's unlikely that I will gain a deep understanding of the community in near future.

jameslamb commented 1 year ago

👋🏻 I'd be happy to help out more with the R package, if you're open to it. I'll try some more small PRs to get familiar with XGBoost generally and the R package specifically before addressing the specific features you've listed.

trivialfis commented 1 year ago

@jameslamb Thank you for working on the R package! Feel free to ping me if there's anything I can help.

dfsnow commented 1 year ago

@trivialfis I'm interested in tackling the first issue: adding categorical support to the existing R implementation. I should be able to get it done within the next month or two. Are you open to me creating a draft PR in a fork and pinging you + @jameslamb when it's ready?

trivialfis commented 1 year ago

Thank you for volunteering! Feel free to ping me.

mayer79 commented 1 year ago

Great initiative, thanks a lot @trivialfis :

If @david-cortes has a little bit of spare time, he would be our man here. I can also help, but need a little bit of time to get used to the code base.
I'd love to see multi-target support in R! Same for the multi-quantile loss.
Do we already have SHAP values for the multi-target (and multi-quantile) situation (not only in R, but in XGBoost in general)?

david-cortes commented 1 year ago

Sure I could help if needed.

But I think a first step would be to decide on what the final result should look like, how far are the changes meant to be, and whether we're going to hold off the R release until all changes are complete (e.g. by keeping a separate branch) or introduce them gradually alongside with releases; or maybe even release under a different name like xgboost2.

Right now, xgboost works very differently from base R's stats::glm, and from most other R packages for statistical modeling, whether tree-based (e.g. ranger, randomForest, gbm3, etc.) or linear-based (e.g. VGAM, glmnet, glmx, ncvreg, grpreg, etc.); both when fitting models and when making predictions. Using xgboost in R without any wrapper is quite inconvenient and requires a lot more code than other packages.

Changing the interface to be more idiomatic would be nicer for users, and I think this major version jump is the best chance to introduce big changes as there probably won't be any next major release any time soon, if ever.

However, there's a lot of packages that depend on xgboost in some way (currently 133 listed in CRAN alone), plus lots of scripts in the wild, and for a package as widely used, any breaking change would lead to a large amount of frustrated users, plus would involve notifying the maintainers of those CRAN/bioconductor packages and perhaps subimitting PRs to them.

So I'd say let's first clarify on what's acceptable to change and what isn't.

For example, off the top of my mind, I can think of the following un-idiomatic behaviors that could be improved:

Accepting formulas as possible inputs. This could be changed for example by changing xgboost to a generic method and having a specialization for formula, and another x/y interface that would accept inputs like xgb.DMatrix/matrix/dgCMatrix. Here however we'd need to decide on a lot of nuances such as how to treat categoricals, whether to use base R's interpretation of formulas or some other parser with different options, etc.
Accepting data.frame as inputs. Ideally, here it should automatically detect which variables are to be considered as categorical (either factor, or factor+character like base R). Perhaps it should also keep a list of their levels to recode variables at prediction time, and a list of the column names for feature importances, plots, and such.
Allowing other arguments to be able to refer to column names from a data.frame, such as monotonicity constraints.
Accepting other sources of data like arrow. Perhaps this could be left only for xgb.DMatrix if that makes it easier.
Accepting an argument to specify categorical columns in non-data.frame inputs. Here we'd need to clarify whether that would be supposed to error-out on data.frame or not, for example.
Accepting factors as response variable for classification objectives, and keeping a list of their levels for predictions. Ideally also making the default objective switch to logistic/multinomial when passed a factor.
Using those factor levels as output/names from the prediction function and plot function.
Perhaps also accepting a column name from a data frame as response variable and/or as weights, either as string or using non-standard evaluation.
Changing the names of the arguments to functions and their positions in the call to better match with base R and other packages (e.g. "weights" instead of "weight").
Moving more options to top-level arguments, like objective and nthreads.
Using 1-based indexing for things that refer to enumerations, such as tree/node numbers.
Changing the structure of the predict function - e.g. a type argument, enforced output shape as suggested here, keeping metadata like row names, etc.
Potentially having the option for predict to re-order columns to match names and recode factors to the levels seen during training. This is quite unusual among packages though.
Having a fit control function for generating parameters with auto-completion capabilities. Not sure how feasible this is though, as I think there isn't a single place with a list of parameters.
Improve efficiency of the prediction by using the same mechanisms for passing data as the python interface does.
Allow passing survival objects for survival objectives.

Some of these would involve large breakage in reverse dependencies, while others wouldn't, but I think the biggest gains here are precisely the ones that would cause breakage.

hcho3 commented 1 year ago

there's a lot of packages that depend on xgboost in some way (currently 133 listed in CRAN alone), plus lots of scripts in the wild, and for a package as widely used, any breaking change would lead to a large amount of frustrated users, plus would involve notifying the maintainers of those CRAN/bioconductor packages and perhaps subimitting PRs to them.

I am personally in favor of creating {xgboost2} to avoid this very issue. Already, XGBoost 2.0 breaks multiple reverse dependencies and hence it's not yet available on CRAN.

mayer79 commented 1 year ago

@hcho3: I wanted to propose {xgboost2} as well, but did not dare :-)

trivialfis commented 1 year ago

I'm happy with xgboost2. Should it be a different package or can live side by side with the existing one?

hcho3 commented 1 year ago

@trivialfis We can host the R code in this repository, but we'd submit the code as {xgboost2} to CRAN. CRAN will still have the old XGBoost as {xgboost}.

trivialfis commented 1 year ago

but we'd submit the code as {xgboost2} to CRAN. CRAN will still have the old XGBoost as {xgboost}.

I will leave that for the actual implementation to decide. I'm fine with having a different CRAN package or continue to use the existing one with a new interface.

hcho3 commented 1 year ago

I see. So we could define new functions suffixed with Ex or ng or something like that, if we want to keep using {xgboost} CRAN package. My personal preference is to create new CRAN package {xgboost2}, for three reasons:

We can use most natural names for functions, instead of adding awkward suffixes.
By creating a new CRAN package, we are not weighed down by legacy functions breaking tests or reverse dependency check on CRAN. The more functions we have in the package, the more burdensome it will be to maintain the CRAN package.
By starting fresh, we can make great changes to the R package without worrying about breaking legacy methods.

What do you think? @david-cortes @mayer79 @jameslamb @dfsnow

hcho3 commented 1 year ago

@trivialfis Also, we need to decide whether we want to try to submit the current XGBoost 2.0 to CRAN. My guess is that it would be a lot of work; correct me if I'm wrong. For example, we'd need to submit pull requests to reverse dependencies. The new requirement for limiting multithreading on the test farm is proving to be a headache as well.

My personal preference is to not bother with submitting current XGBoost code to CRAN. Let's start making great changes to the R package now, and later we can submit {xgboost2}.

trivialfis commented 1 year ago

The new requirement for limiting multithreading

This one is solved after many debugging efforts during the 2.0 release. The reverse dependency is a little bit soul-crushing though, which I haven't been able to finish.

We can use most natural names for functions, instead of adding awkward suffixes.

By starting fresh, we can make great changes to the R package without worrying about breaking legacy methods.

I'm totally on board with this, it's just whether @david-cortes and @mayer79 want to take an incremental approach or start anew, or somewhere in the middle with a transition peroid. Either way, I'm happy to assist and will try to support whatever decision is made by the implementor.

mayer79 commented 1 year ago

We would probably write the train/cv-API from scratch, but recycle the internals that wrap the C++ functions. I will start to study the current functions this weekend.

david-cortes commented 1 year ago

I don't know what's the best way forward here. I see several potential routes.

(a) Release under name xgboost2, holding off the release until the hypothetical new R interface is finished.

In this case, I guess it might be better not to make a 2.0 release of xgboost in order not to cause further confusions.
After release, it would involve asking dependencies to switch to the new package name.
It'd be quite inconvenient for users to have xgboost and xgboost2 co-existing - would probably lead to lots of confusions, have poor discoverability, and a substantial amount of users will end up running older versions without being aware of the differences. Potentially lots of invalid bug reports in the future too.
Would make any development substantially easier and faster however. Liaison with current reverse dependencies would just amount to sending them an email informing them of the new version and asking them to update, with no follow up.
I don't think it'd be possible to remove the original xgboost later on or to reclaim its name, given the large amounts of reverse dependencies.
I guess v1 might still require maintenance updates in the future as e.g. compiler changes push new requirements to CRAN.

(b) Release an xgboost==2.0.0 now, and make gradual changes to the R interface with regular releases.

It would take a long amount of time to reach the desired state here, and would require tight coordination with reverse dependencies on major breaking changes and deprecations.
I'm not sure if it'd be a good idea to make large changes with minor/patch version increments.
Probably quite an annoyance for users of the package to keep up-to-date with continuous changes.
Would probably be the most convenient option for discoverability and for users who want the latest features.
This is probably the option with the highest development effort.
Might not be possible to ever make all desired breaking changes, getting stuck with a suboptimal interface.
Will anyway end up causing lots of breakage, just spreaded over time.

(c) Release under name xgboost, holding off v2.0 release until the hypothetical new R interface is finished, then release with this new interface replacing the old one.

Would cause enormous breakage, probably in every single reverse dependency, and in most user code that calls xgboost directly (including e.g. stackoverflow answers).
Would require informing reverse dependencies of the required changes. Given the amount of reverse dependencies, I guess this would have to be without sending them PRs (just the way large packages like Matrix do), but will still require exchanges with them. Lots of dependencies will not update and will get removed from CRAN afterwards, breaking further user code.
There will be a period of madness-level whole-CRAN breakage right after submission as the reverse dependencies will not make the update in synch, and probably lots of unhappy users from this. Would likely require rounds of negotiation with CRAN administrators about the timing and involvement with dependencies.
Will make development easy and fast.

(d) Release under name xgboost, with an additional new interface having different names from the older one, either holding off v2.0 release until it's ready, or releasing gradually.

This would be tricky to develop. Especially difficult here is dealing with names of classes, method dispatches, exports, etc. in a way that would call the right C functions with the right arguments and not cause breakage (e.g. how do we determine when a parameter is meant to contain column names or column indices, whether they are 1-based or 0-based, etc.). Might require lots of duplicated code (e.g. a different class for xgb.train, another for the old xgboost that would be deprecated, then another for the new xgboost2 or whatever it might be called, each with their corresponding predict.<class>, plot.<class>, etc. having different arguments, perhaps with some C calls being shared among them and others being separate).
This is probably the most harmless option for current users and reverse dependencies, but might still make discoverability suboptimal given that names like xgboost() are already taken and that there's lots of e.g. tutorials, blog posts, and stackoverflow answers for them.
It'd make the code harder to understand, potentially pushing off contributors, and leading to invalid bug reports due to confusions.
Comparatively more prone to bugs.
After a very long period, perhaps the earlier interface could be retired and the name reclaimed, but I'm not certain that this would be quick by any means or possible to do.

From these options, I guess (d) is the most "correct" one. (c) is very tempting for me. (a) is suboptimal for everyone but perhaps a feasible compromise. (b) I don't like but I guess it's the way most high-profile packages operate.

Regardless of the choice, it'd nevertheless be helpful if we could come up with a roadmap with checklists of all the things that need to be changed, and perhaps a discussion thread to discuss about what changes are needed or welcomed (e.g. to discuss about how to deal with categorical features in the hypothetical formula interface). And of course would be helpful if this roadmap could be kept up-to-date from those discussions and updated as PRs are merged. And also to make a distinction between user-facing and non-user-facing - for example, using inplace predict makes no difference in the external interface, and would thus not block a release or block other PRs, whereas breaking down the model-fitting function into a formula and an x/y interface would be a big user-facing change.

As a different topic, I am also wondering about timelines and mode of development. For example, lightgbm has also been switching away from the interface it initially copied from xgboost, with these changes having been in small PRs each with lots of tests. This has been ongoing for 2 years and it's still not finished.

In the case of xgboost, I guess maintainer bandwidth is higher, the R interface is a comparatively easier codebase (e.g. no R6), and the amount of desired changes is much smaller; but introducing changes in granular PRs that each have dependencies on each other is a long development process, the moreso if we want it very thoroughly tested.

I'd say introducing a new interface in one go where everyone in this thread makes comments would make things easier, but it's more prone to bugs and hard to review. Not sure what maintainers here would think though.

On yet another topic, I am wondering if it'd be possible to get helper features in the C interface from the core xgboost developers in order to improve the R interface. For example, I had a previous (unsuccessful) go at modifying parts of the predict function here and would find it very helpful if for example there was a C-level functionality to distinguish which objectives are for classification, and how many classes they involve.

Likewise, it'd be helpful if for example the C interface could take a parameter index1 as part of the input JSON and use 1-based indexing everywhere if so (e.g. when numerating nodes or boosting rounds, when refering to column indices of interaction constraints, etc.), but that's quite outside of the scope of an R interface, even though it'd make it more in line with the rest of the R ecosystem.

As yet another comment, the R interface currently has a divison between Booster and Booster.handle. I don't think many people or reverse dependencies are using the Booster.handle directly (such an object is kept inside of the Booster, which is what the main functions return), and removing its methods from the public interface is probably going to make the codebase easier to understand and to develop, as well as make the docs easier to follow. I think this could be a good opportunity to make the Booster.handle internal-only.

hcho3 commented 1 year ago

I'd say introducing a new interface in one go where everyone in this thread makes comments would make things easier, but it's more prone to bugs and hard to review.

We can mitigate the disadvantage of this approach by first doing a design review. Contributors would review the proposed interface and see if it is reasonable. As the most impactful changes are user-facing, design review will be useful.

Personally, as long as the new interface is more ergonomic and idiomatic R, I would not quibble with minor implementation details too much. I am no expert of R, and I would have to defer to you when it comes to nitty gritty details of programming in R.

hcho3 commented 1 year ago

I am wondering if it'd be possible to get helper features in the C interface from the core xgboost developers in order to improve the R interface.

Yes, I am more than happy to help.

trivialfis commented 1 year ago

I had a previous (unsuccessful) go at modifying parts of the predict function https://github.com/dmlc/xgboost/pull/7947 and would find it very helpful if for example there was a C-level functionality to distinguish which objectives are for classification, and how many classes they involve.

Let's start with this. I will look into the issue. Related https://github.com/dmlc/xgboost/issues/9640 .

Likewise, it'd be helpful if for example the C interface could take a parameter index1 as part of the input JSON and use 1-based indexing everywhere

I think this will probably be an R-specific issue, considering that it's the only interface that uses a 1-based index. I don't think solving it in C is easier or cleaner than solving it in R. Maybe we can have some utilities in R to handle all indexing.

trivialfis commented 1 year ago

@david-cortes Thank you for the very detailed and informative breakdown of various options and obstacles. Here's my two cents:

I think we all agree that a huge set of breaking changes is inevitable. It's not quite productive to think about compatibility when the goal is to rewrite everything, The question is how to handle reverse dependencies after we introduce the brand-new interface. This is orthogonal to how we will develop the new R package, we can worry about the CRAN package after we have finalized the new interface. The development of the new package should focus on the package itself and bear no historical burden. So, let's take a step back and focus on the new interface for now.

For the actual development, I agree that either a checklist or a design doc would be helpful for everyone to see the direction and status. We can also discuss some technical details in the doc. Feel free to open a new issue for this. (or use google doc if you prefer). For instance, we can sort out the details for the predict function and what's needed to implement it in the design doc.

I'd say introducing a new interface in one go where everyone in this thread makes comments

After having the tracking issue/design doc, I think we should merge PRs progressively even if there's a feature-complete prototype (we can split it up into multiple PRs). This is a good practice for reviews and for organizing code for unit tests.

In the end, I think we will follow the usual steps:

design doc/roadmap tracking.
implement the interface iteratively, likely starting from the new booster type and upward to user-facing features.
new documentation.

I can host online chats using teams if needed, particularly for working on C-related primitives. ;-)

trivialfis commented 1 year ago

's not quite productive to think about compatibility when the goal is to rewrite everything, The question is how to handle reverse dependencies after we introduce the brand-new interface. This is orthogonal to how we will develop the new R package,

Just to clarify, I don't mean to break everyone's code. I think we will either launch a new package or launch a new interface without touching the existing one. Among the options listed by @david-cortes , I think we can agree that we will have two interfaces by the end, the question is how to introduce the new one to everyone.

david-cortes commented 1 year ago

Created a different issue to discuss details of what the R interface should look like and discuss about some potential issues in the implementation: https://github.com/dmlc/xgboost/issues/9734

Would be ideal if people in this thread could take a look and comment there before starting with the coding.

dfsnow commented 1 year ago

Given the plan to create an entirely new interface, is there still a reason to work on the existing one? Specifically, I planned to add categorical support sometime before the end of the year, is there still an appetite for that PR?

As an aside, while I love the idea of a better maintained/more idiomatic R interface, I'm a bit nervous about a full rewrite. My concerns as a mostly uninformed outside observer are:

The R package struggled to find maintainers before, so it seems dubious that capacity exists for a full rewrite.
Adding new features (formulas, split high/low interface) will increase the maintenance burden even further.
Transitioning to a new package (xgboost2) is notoriously hard and is likely to split the ecosystem of users.

Basically, I'm worried that what we'll end up with is a a mostly unmaintained xgboost package, a new xgboost2 package a year from now, and an even bigger maintenance burden for @david-cortes and @trivialfis. That said, I don't have the detailed knowledge of the R interface necessary to estimate the amount of work involved here, so I could be way off base.

david-cortes commented 1 year ago

Given the plan to create an entirely new interface, is there still a reason to work on the existing one? Specifically, I planned to add categorical support sometime before the end of the year, is there still an appetite for that PR?

As an aside, while I love the idea of a better maintained/more idiomatic R interface, I'm a bit nervous about a full rewrite. My concerns as a mostly uninformed outside observer are:

The R package struggled to find maintainers before, so it seems dubious that capacity exists for a full rewrite.

Adding new features (formulas, split high/low interface) will increase the maintenance burden even further.

Transitioning to a new package (xgboost2) is notoriously hard and is likely to split the ecosystem of users.

Basically, I'm worried that what we'll end up with is a a mostly unmaintained xgboost package, a new xgboost2 package a year from now, and an even bigger maintenance burden for @david-cortes and @trivialfis. That said, I don't have the detailed knowledge of the R interface necessary to estimate the amount of work involved here, so I could be way off base.

It doesn't need to be a full re-write since ideally the higher-level interface would be calling the lower-level interface (which likely won't end up with many changes from current code) in most cases and adding extra things on top.

So features like adding categorical features for DMatrix from current supported inputs (e.g. matrix, dgCMatrix) or using inplace-predict for matrix objects would not be in vain.

trivialfis commented 1 year ago

Probably need some more fixes on 1.7:

Dear maintainer,

Please see the problems shown on https://cran.r-project.org/web/checks/check_results_xgboost.html.

Please correct before 2023-12-12 to safely retain your package on CRAN.

Best, -k

dmlc / xgboost

[Call for help] Bringing the R package up-to-date. #9475