Open trivialfis opened 1 year ago
As a side note, at the moment I'm the maintainer of the CRAN package. If there are contributors who want to share this I'm more than happy to assist and change the maintainership. I have some familiarity with R for statistical computing, but it's not my daily drive and it's unlikely that I will gain a deep understanding of the community in near future.
👋🏻 I'd be happy to help out more with the R package, if you're open to it. I'll try some more small PRs to get familiar with XGBoost generally and the R package specifically before addressing the specific features you've listed.
@jameslamb Thank you for working on the R package! Feel free to ping me if there's anything I can help.
@trivialfis I'm interested in tackling the first issue: adding categorical support to the existing R implementation. I should be able to get it done within the next month or two. Are you open to me creating a draft PR in a fork and pinging you + @jameslamb when it's ready?
Thank you for volunteering! Feel free to ping me.
Great initiative, thanks a lot @trivialfis :
Sure I could help if needed.
But I think a first step would be to decide on what the final result should look like, how far are the changes meant to be, and whether we're going to hold off the R release until all changes are complete (e.g. by keeping a separate branch) or introduce them gradually alongside with releases; or maybe even release under a different name like xgboost2
.
Right now, xgboost
works very differently from base R's stats::glm
, and from most other R packages for statistical modeling, whether tree-based (e.g. ranger
, randomForest
, gbm3
, etc.) or linear-based (e.g. VGAM
, glmnet
, glmx
, ncvreg
, grpreg
, etc.); both when fitting models and when making predictions. Using xgboost
in R without any wrapper is quite inconvenient and requires a lot more code than other packages.
Changing the interface to be more idiomatic would be nicer for users, and I think this major version jump is the best chance to introduce big changes as there probably won't be any next major release any time soon, if ever.
However, there's a lot of packages that depend on xgboost
in some way (currently 133 listed in CRAN alone), plus lots of scripts in the wild, and for a package as widely used, any breaking change would lead to a large amount of frustrated users, plus would involve notifying the maintainers of those CRAN/bioconductor packages and perhaps subimitting PRs to them.
So I'd say let's first clarify on what's acceptable to change and what isn't.
For example, off the top of my mind, I can think of the following un-idiomatic behaviors that could be improved:
xgboost
to a generic method and having a specialization for formula
, and another x/y interface that would accept inputs like xgb.DMatrix
/matrix
/dgCMatrix
. Here however we'd need to decide on a lot of nuances such as how to treat categoricals, whether to use base R's interpretation of formulas or some other parser with different options, etc.data.frame
as inputs. Ideally, here it should automatically detect which variables are to be considered as categorical (either factor
, or factor
+character
like base R). Perhaps it should also keep a list of their levels to recode variables at prediction time, and a list of the column names for feature importances, plots, and such.data.frame
, such as monotonicity constraints.arrow
. Perhaps this could be left only for xgb.DMatrix
if that makes it easier.data.frame
or not, for example.factor
.objective
and nthreads
.predict
function - e.g. a type
argument, enforced output shape as suggested here, keeping metadata like row names, etc.predict
to re-order columns to match names and recode factors to the levels seen during training. This is quite unusual among packages though.Some of these would involve large breakage in reverse dependencies, while others wouldn't, but I think the biggest gains here are precisely the ones that would cause breakage.
there's a lot of packages that depend on xgboost in some way (currently 133 listed in CRAN alone), plus lots of scripts in the wild, and for a package as widely used, any breaking change would lead to a large amount of frustrated users, plus would involve notifying the maintainers of those CRAN/bioconductor packages and perhaps subimitting PRs to them.
I am personally in favor of creating {xgboost2}
to avoid this very issue. Already, XGBoost 2.0 breaks multiple reverse dependencies and hence it's not yet available on CRAN.
@hcho3: I wanted to propose {xgboost2} as well, but did not dare :-)
I'm happy with xgboost2. Should it be a different package or can live side by side with the existing one?
@trivialfis We can host the R code in this repository, but we'd submit the code as {xgboost2}
to CRAN. CRAN will still have the old XGBoost as {xgboost}
.
but we'd submit the code as {xgboost2} to CRAN. CRAN will still have the old XGBoost as {xgboost}.
I will leave that for the actual implementation to decide. I'm fine with having a different CRAN package or continue to use the existing one with a new interface.
I see. So we could define new functions suffixed with Ex
or ng
or something like that, if we want to keep using {xgboost}
CRAN package. My personal preference is to create new CRAN package {xgboost2}
, for three reasons:
What do you think? @david-cortes @mayer79 @jameslamb @dfsnow
@trivialfis Also, we need to decide whether we want to try to submit the current XGBoost 2.0 to CRAN. My guess is that it would be a lot of work; correct me if I'm wrong. For example, we'd need to submit pull requests to reverse dependencies. The new requirement for limiting multithreading on the test farm is proving to be a headache as well.
My personal preference is to not bother with submitting current XGBoost code to CRAN. Let's start making great changes to the R package now, and later we can submit {xgboost2}
.
The new requirement for limiting multithreading
This one is solved after many debugging efforts during the 2.0 release. The reverse dependency is a little bit soul-crushing though, which I haven't been able to finish.
We can use most natural names for functions, instead of adding awkward suffixes.
By starting fresh, we can make great changes to the R package without worrying about breaking legacy methods.
I'm totally on board with this, it's just whether @david-cortes and @mayer79 want to take an incremental approach or start anew, or somewhere in the middle with a transition peroid. Either way, I'm happy to assist and will try to support whatever decision is made by the implementor.
We would probably write the train/cv-API from scratch, but recycle the internals that wrap the C++ functions. I will start to study the current functions this weekend.
I don't know what's the best way forward here. I see several potential routes.
(a) Release under name xgboost2
, holding off the release until the hypothetical new R interface is finished.
xgboost
in order not to cause further confusions.xgboost
and xgboost2
co-existing - would probably lead to lots of confusions, have poor discoverability, and a substantial amount of users will end up running older versions without being aware of the differences. Potentially lots of invalid bug reports in the future too.xgboost
later on or to reclaim its name, given the large amounts of reverse dependencies.(b) Release an xgboost==2.0.0
now, and make gradual changes to the R interface with regular releases.
(c) Release under name xgboost
, holding off v2.0 release until the hypothetical new R interface is finished, then release with this new interface replacing the old one.
Matrix
do), but will still require exchanges with them. Lots of dependencies will not update and will get removed from CRAN afterwards, breaking further user code.(d) Release under name xgboost
, with an additional new interface having different names from the older one, either holding off v2.0 release until it's ready, or releasing gradually.
xgb.train
, another for the old xgboost
that would be deprecated, then another for the new xgboost2
or whatever it might be called, each with their corresponding predict.<class>
, plot.<class>
, etc. having different arguments, perhaps with some C calls being shared among them and others being separate).xgboost()
are already taken and that there's lots of e.g. tutorials, blog posts, and stackoverflow answers for them.From these options, I guess (d) is the most "correct" one. (c) is very tempting for me. (a) is suboptimal for everyone but perhaps a feasible compromise. (b) I don't like but I guess it's the way most high-profile packages operate.
Regardless of the choice, it'd nevertheless be helpful if we could come up with a roadmap with checklists of all the things that need to be changed, and perhaps a discussion thread to discuss about what changes are needed or welcomed (e.g. to discuss about how to deal with categorical features in the hypothetical formula interface). And of course would be helpful if this roadmap could be kept up-to-date from those discussions and updated as PRs are merged. And also to make a distinction between user-facing and non-user-facing - for example, using inplace predict makes no difference in the external interface, and would thus not block a release or block other PRs, whereas breaking down the model-fitting function into a formula and an x/y interface would be a big user-facing change.
As a different topic, I am also wondering about timelines and mode of development. For example, lightgbm
has also been switching away from the interface it initially copied from xgboost
, with these changes having been in small PRs each with lots of tests. This has been ongoing for 2 years and it's still not finished.
In the case of xgboost
, I guess maintainer bandwidth is higher, the R interface is a comparatively easier codebase (e.g. no R6), and the amount of desired changes is much smaller; but introducing changes in granular PRs that each have dependencies on each other is a long development process, the moreso if we want it very thoroughly tested.
I'd say introducing a new interface in one go where everyone in this thread makes comments would make things easier, but it's more prone to bugs and hard to review. Not sure what maintainers here would think though.
On yet another topic, I am wondering if it'd be possible to get helper features in the C interface from the core xgboost developers in order to improve the R interface. For example, I had a previous (unsuccessful) go at modifying parts of the predict
function here and would find it very helpful if for example there was a C-level functionality to distinguish which objectives are for classification, and how many classes they involve.
Likewise, it'd be helpful if for example the C interface could take a parameter index1
as part of the input JSON and use 1-based indexing everywhere if so (e.g. when numerating nodes or boosting rounds, when refering to column indices of interaction constraints, etc.), but that's quite outside of the scope of an R interface, even though it'd make it more in line with the rest of the R ecosystem.
As yet another comment, the R interface currently has a divison between Booster
and Booster.handle
. I don't think many people or reverse dependencies are using the Booster.handle
directly (such an object is kept inside of the Booster
, which is what the main functions return), and removing its methods from the public interface is probably going to make the codebase easier to understand and to develop, as well as make the docs easier to follow. I think this could be a good opportunity to make the Booster.handle
internal-only.
I'd say introducing a new interface in one go where everyone in this thread makes comments would make things easier, but it's more prone to bugs and hard to review.
We can mitigate the disadvantage of this approach by first doing a design review. Contributors would review the proposed interface and see if it is reasonable. As the most impactful changes are user-facing, design review will be useful.
Personally, as long as the new interface is more ergonomic and idiomatic R, I would not quibble with minor implementation details too much. I am no expert of R, and I would have to defer to you when it comes to nitty gritty details of programming in R.
I am wondering if it'd be possible to get helper features in the C interface from the core xgboost developers in order to improve the R interface.
Yes, I am more than happy to help.
I had a previous (unsuccessful) go at modifying parts of the predict function https://github.com/dmlc/xgboost/pull/7947 and would find it very helpful if for example there was a C-level functionality to distinguish which objectives are for classification, and how many classes they involve.
Let's start with this. I will look into the issue. Related https://github.com/dmlc/xgboost/issues/9640 .
Likewise, it'd be helpful if for example the C interface could take a parameter index1 as part of the input JSON and use 1-based indexing everywhere
I think this will probably be an R-specific issue, considering that it's the only interface that uses a 1-based index. I don't think solving it in C is easier or cleaner than solving it in R. Maybe we can have some utilities in R to handle all indexing.
@david-cortes Thank you for the very detailed and informative breakdown of various options and obstacles. Here's my two cents:
I think we all agree that a huge set of breaking changes is inevitable. It's not quite productive to think about compatibility when the goal is to rewrite everything, The question is how to handle reverse dependencies after we introduce the brand-new interface. This is orthogonal to how we will develop the new R package, we can worry about the CRAN package after we have finalized the new interface. The development of the new package should focus on the package itself and bear no historical burden. So, let's take a step back and focus on the new interface for now.
For the actual development, I agree that either a checklist or a design doc would be helpful for everyone to see the direction and status. We can also discuss some technical details in the doc. Feel free to open a new issue for this. (or use google doc if you prefer). For instance, we can sort out the details for the predict function and what's needed to implement it in the design doc.
I'd say introducing a new interface in one go where everyone in this thread makes comments
After having the tracking issue/design doc, I think we should merge PRs progressively even if there's a feature-complete prototype (we can split it up into multiple PRs). This is a good practice for reviews and for organizing code for unit tests.
In the end, I think we will follow the usual steps:
I can host online chats using teams if needed, particularly for working on C-related primitives. ;-)
's not quite productive to think about compatibility when the goal is to rewrite everything, The question is how to handle reverse dependencies after we introduce the brand-new interface. This is orthogonal to how we will develop the new R package,
Just to clarify, I don't mean to break everyone's code. I think we will either launch a new package or launch a new interface without touching the existing one. Among the options listed by @david-cortes , I think we can agree that we will have two interfaces by the end, the question is how to introduce the new one to everyone.
Created a different issue to discuss details of what the R interface should look like and discuss about some potential issues in the implementation: https://github.com/dmlc/xgboost/issues/9734
Would be ideal if people in this thread could take a look and comment there before starting with the coding.
Given the plan to create an entirely new interface, is there still a reason to work on the existing one? Specifically, I planned to add categorical support sometime before the end of the year, is there still an appetite for that PR?
As an aside, while I love the idea of a better maintained/more idiomatic R interface, I'm a bit nervous about a full rewrite. My concerns as a mostly uninformed outside observer are:
xgboost2
) is notoriously hard and is likely to split the ecosystem of users.Basically, I'm worried that what we'll end up with is a a mostly unmaintained xgboost
package, a new xgboost2
package a year from now, and an even bigger maintenance burden for @david-cortes and @trivialfis. That said, I don't have the detailed knowledge of the R interface necessary to estimate the amount of work involved here, so I could be way off base.
Given the plan to create an entirely new interface, is there still a reason to work on the existing one? Specifically, I planned to add categorical support sometime before the end of the year, is there still an appetite for that PR?
As an aside, while I love the idea of a better maintained/more idiomatic R interface, I'm a bit nervous about a full rewrite. My concerns as a mostly uninformed outside observer are:
- The R package struggled to find maintainers before, so it seems dubious that capacity exists for a full rewrite.
- Adding new features (formulas, split high/low interface) will increase the maintenance burden even further.
- Transitioning to a new package (
xgboost2
) is notoriously hard and is likely to split the ecosystem of users.Basically, I'm worried that what we'll end up with is a a mostly unmaintained
xgboost
package, a newxgboost2
package a year from now, and an even bigger maintenance burden for @david-cortes and @trivialfis. That said, I don't have the detailed knowledge of the R interface necessary to estimate the amount of work involved here, so I could be way off base.
It doesn't need to be a full re-write since ideally the higher-level interface would be calling the lower-level interface (which likely won't end up with many changes from current code) in most cases and adding extra things on top.
So features like adding categorical features for DMatrix
from current supported inputs (e.g. matrix
, dgCMatrix
) or using inplace-predict for matrix
objects would not be in vain.
Probably need some more fixes on 1.7:
Dear maintainer,
Please see the problems shown on https://cran.r-project.org/web/checks/check_results_xgboost.html.
Please correct before 2023-12-12 to safely retain your package on CRAN.
Best, -k
We are maintaining the R package with the best effort, but due to resource limitations, the R package feature has been lagging behind compared to the Python package. Here is a list of things that can be improved. Contributions are appreciated!
factor
type and pass related information into xgboost.QuantileDMatrix
support can help R users to reduce memory usage.DataIter
.