dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.1k stars 8.7k forks source link

Proposal for new R interface (discussion thread) #9734

Open david-cortes opened 10 months ago

david-cortes commented 10 months ago

ref https://github.com/dmlc/xgboost/issues/9475

I'm opening this issue in order to propose a high-level overview of how an idiomatic R interface for xgboost should ideally look like and some thoughts on how it might be implemented (note: I'm not very familiar with xgboost internals so have several questions). Would be ideal to hear comments from other people regarding these ideas. I'm not aware of everything that xgboost can do so perhaps I might be missing something important here.

(Perhaps a GitHub issue is not the most convenient format for this though, but I don't where else to put it)

Looking at the python and C interfaces, I see that there's a low-level interface with objects like Booster with associated functions/methods xgb.train that more or less reflects how the C interface works, and then there's a higher-level scikit-learn interface that wraps the lower-level functions into a friendlier interface that works more or less the same way as scikit-learn objects, which is the most common ML framework for python.

In the current R interface, there's to some degree a division into lower/higher-level interfaces with xgb.train() and xgboost(), but there's less of a separation as e.g. both return objects of the same class, and the high-level interface is not very high-level as it doesn't take the same kinds of arguments as base R's stats::glm() or other popular R packages - e.g. it doesn't take formulas, nor data frames.


In my opinion, a more ideal R interface could consist of a mixture of a low-level interface reflecting the way the underling C API works (which ideally most reverse dependencies and very latency-sensitive applications should be using) just like the current python interface, plus a high-level interface that would make it ergonomic to use at the cost of some memory and performance overhead, behaving similarly as core and popular R packages for statistical modeling. Especially important in this higher-level interface would be to make use of all the rich metadata that R objects have (e.g. column names, levels of factor/categorical variables, specialized classes like survival::Surv, etc.), for both inputs and outputs.

In the python scikit-learn interface, there are different classes depending on the task (regression / classification / ranking) and the algorithm mode (boosting / random forest). This is rather uncommon in R packages - for example, randomForest() will switch between regression and classification according to the type of the response variable - and I think it could be better to keep everything in the high-level interface under a single function xgboost() and single class returned from that function, at the expense of more arguments and more internal metadata in the returned objects.

In my view, a good separation between the two would involve: * Having the low-level interface work **only** with `DMatrix` (which could be created from different sources like files, in-memory objects, etc.), and the high-level interface work **only** with common R objects (e.g. `data.frame`, `matrix`, `dgCMatrix`, `dgRMatrix`, etc.), but **not** with `DMatrix` (IMO would not only make the codebase and docs simpler but also given that both have different metadata, would make it easier to guess what to expect as outputs from other functions). * Having separate classes for the outputs returned from both interfaces, with different associated methods and method signatures, so that e.g. the `predict` function for the object returned by `xgb.train()` would mimic the prediction function from the C interface and be mindful of details like not automatically transposing row-major outputs, while the `predict` function from the object returned from `xgboost()` would mimic base R and popular R packages and be able to e.g. return named factors for classification objectives. * The plot functions perhaps could be shared, with some extra arguments being required for the low-level interface but forbidden for the high-level interface. * Having separate associated `cv` functions for the low-level and the high-level interfaces (e.g. `xgboost.cv()` or `cv.xgboost()`, like there is `cv.glmnet()`; and another `xgb.cv()` that would work with `DMatrix`). * Keeping extra metadata inside the object returned by the high-level interface, such as column names, types, factor levels, the actual string of the function call (the way base R does), etc. * Exporting from the package only what's necessary to support these two modes of operation alongside with their methods - e.g. not exporting internal functions like `xgboost::normalize` or internal classes/methods like `Booster.handle`.
A couple extra details **about the high-level interface** that in my opinion would make it nicer to use. I think many people might disagree with some of the points here though so would like to hear opinions on these ideas: * It should have an x/y interface, and perhaps also a formula interface (details on the latter to follow). * The x/y interface should accept `x` in the format of common R classes like `data.frame`, `matrix`, `dgCMatrix`, perhaps `arrow` (although I wouldn't consider it a priority), and perhaps other classes like `float32` (from the `float` package). I don't think it'd be a good idea to support more obscure potential classes like `dist` that aren't typically used in data pipelines. * If I'm not mistaken, CSR is supported for prediction but not for training, so `predict` should additionally take `dgRMatrix` and `sparseVector` like it does currently. * I am not into GPU computing, nor into distributed computing, so perhaps I might be missing important classes here. * If supporting Spark in R, I guess the best way would be to have a separate package dedicated to it. * (**Controversial**) It should not only accept but also **require** different classes for `y` depending on the objective, and by default, if not passed, should select an objective based on the type of y: * `factor` with 2 levels -> `binary:logistic`. * `factor` with >2 levels -> `multi:softmax`. * `survival::Surv` -> `survival:aft` (with normal link), or maybe `survival:cox` if only right-censored. * others -> `reg:squarederror`. * And consequently, require `factor` types for classification, `Surv` for survival, and numeric/integer for regression (taking `logical` (boolean) as 0/1 numeric). * If multi-column response/target variables are to be allowed, they should be required to be 2-dimensional (i.e. no vector recycling), as either `data.frame` or `matrix`, and keep their column names as metadata if they had. * If `x` is a `data.frame`, it should automatically recognize `factor` columns as categorical types, and (**controversial**) also take `character` (string) class as categorical, converting them to `factor` on the fly, just like R's formula interface for GLMs does. Note that not all popular R packages doing the former would do the latter. * (**Controversial**) It should only support categorical columns in the form of an input `x` being a `data.frame` with `factor`/`character` types. Note: see question below about sparse types and categorical types. * If `x` is either a `data.frame` or otherwise has column names, then arguments that reference columns in the data should be able to reference them by name. * For example, if `x` has 4 columns `[c1, c2, c3, c4]` it should allow passing `monotone_constraints` as either `c(-1, 0, 0, 1)`, or as `c("c1" = -1, "c4" = 1)`, or as `list(c1 = -1, c4 = 1)`; but not as e.g. `c("1" = -1, "4" = -1)`. * OTOH if `x` is a `matrix`/`dgCMatrix` without column names, it should accept them as `c(-1, 0, 0, 1)`, or as `c("1" = -1, "4" = -1)`, erroring out on non-integer names, not allowing negative numbers, and not allowing `list` with length

The low-level interface in turn should support everything that the C interface offers - e.g. creating DMatrix from libsvm files, specifying categorical columns from a non-data-frame array, etc.


As per the formula interface, this is quite tricky to implement in a good way for xgboost. In the case of linear models, it's quite handy to create these features on-the-fly to find out good transformations and e.g. be able to call stepwise feature selectors on the result, but: * Stepwise selection from base R doesn't work with xgboost. * Transformations like `log(x1)` have no effect in decision trees, transformations like `x1:x2` don't have the same practical effect as decision trees implicitly create interactions, and transformations like `x^2` the way R does them do not make a difference for decision trees compared to simpler `I(x^2)+x`, for example. * Other outputs from formulas like contrasts are not used either way. * Transformations like `log(y)` aren't any harder to do with an x/y interface than with a formula interface. Nevertheless, a formula interface can still be handy for calls like `y ~ x1 + x2 + x3` when there are many columns, and packages like `ranger` offer such an interface, so perhaps it might be worth having one, even if it's not the recommended way to use. Some nuances in terms of formulas though: * Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept, and has no use for a column named `(Intercept)` that would have the same value for every row. * Base R's formulas will either do one-hot or dummy encoding for factor/character columns according to whether there is an intercept or not, while xgboost now has native support for categorical columns. * Using a `formula` for processing the training data also implies using it for `predict` - so for example, formulas do not recode levels of factor variables when calling `predict`, which the x/y interface could potentially do, leading to differences in behaviors between both interfaces. * Some packages have their own formula parsers which allow additional interpretations for something like `y ~ x | z` than what base R would do (for example, `lme4` would interpret `z` here as something that controls mixed effects for `x`, while base R would interpret this as a feature "x or z"), and in some cases `xgboost()` would also need a different interpretations of formulas (e.g. for parameter `qid`, which doesn't fit at either side of the formula). * Packages like `randomForest` don't use the base R formula parser, taking it instead (by copying the code) from a different library `e1071` which is GPLv2 licensed, which is incompatible with xgboost's apache license. * Refering to columns that are the result of a formula by index in e.g. `monotone_constraints` could be tricky - e.g. if we remove the auto-added `(Intercept)` columns, should the numbers re-adjust? Hence, supporting a formula interface for `xgboost()` would be tricky: * It could be possible to use base R's formula parser, but: * In this case it would not be possible to use categorical features (or their interactions) as such, as they will be either dummy- or one-hot encoded. * It theoretically could be possible to try to forcibly try to add `-1` at the end of the formula (which means "don't add an intercept") by converting it to string and back, in order to get one-hot encoding of factors and avoid adding `(Intercept)`, but I can foresee cases in which this might not work depending on how the formula is inputted. * It would not be possible to specify additional columns like `qid` in these formulas. * It could be possible to develop a custom formula parser that would not one-hot-encode categorical features and e.g. interpret `|` differently, but what should happen in this case if the user requests something like `xnum*xcat` or `f(xcat)` (with `f` being some arbitrary base function) ? Unless we can find some other package that would better handle formula parsing and that could be reasonable to use as dependency (I'm not aware of any), I think the best way forward here would be to: * Use base R's formula parser. * Don't support parameters like `monotone_constraints` or `qid` in the formula interface. * Don't support native categorical columns, relying instead on the one-hot or dummy-encoding from the formula. * Try to remove the intercept by trying to create a new formula ` - 1` and error out if this doesn't succeed. * Let the formula handle all the `predict` post-processing, regardless of how the x/y interface does it. * Encourage users and reverse dependencies to avoid the formula interface for serious usage by being clear about all these limitations in the docs and having a warning message in the "description" section (which comes before the signatures in the doc). We're not in control of how reverse dependencies use it, but if any of the larger ones is tempted to use this formula interface anyway, could be followed up in their githubs.

A couple questions for xgboost developers (@hcho3 @trivialfis ?) I have here: * Does `xgboost` support passing a mixture of dense and sparse data together? `pandas` supports sparse types, but if I understand it correctly from a brief look at the source code of the python interface, it will cast them to dense before creating the `DMatrix` object. If it supports mixtures of both, I think it'd again be ideal if the R interface could also have a way to create such a `DMatrix` for the low-level interface. Not sure if there'd be an idiomatic way to incorporate this in the high-level interface though, unless allowing passing something like a `list`. * Is there a C-level `DMatrix` equivalent of `cbind` / `np.c_` ? I don't see any but this would make things easier. * Can sparse columns be taken as categorical without being densified? If so, I guess specifying categorical columns of a sparse matrix should also be supported in the high-level R interface. * In other interfaces, does passing a data frame of different types always involve putting everything into a contiguous array of the same type? I see there's a C function `XGDMatrixCreateFromDT`, but if I understand it correctly from a look at the pandas processing functions in the scikit-learn interface, if the input involves types like `int64`, these would be casted to a floating-point type. In R, especially when using `data.table`, it's oftentimes common to have 64-bit integers from the package `bit64`, which if casted to `float64`, would lose integral precision beyond $2^52$ (same for `int32` which could lose precision when casted to `float32`) - I am wondering if there should be a way to support these without loss of precision, and whether it's possible to efficiently create a `DMatrix` from a structure like an R `data.frame`, which is a list of arrays that aren't contiguous in memory and might have different dtypes. * Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification? As it is right now, it'd be quite easy to do this in R, but I can foresee a situation in which in the future someone for example submits a PR adding a new objective like `binary:cauchit`, adds it to the python `XGBClassifier` class, but overlooks the R code as it might be unknown to the contributor, and the R interface then won't act properly on receiving this objective. * How does the python interface keep the list of parameters up-to-date with the C interface? Is it done manually by maintainers? Is there some way in which the R interface could automatically keep in synch with the C one every time a new parameter is added? Or will it need to manually keep recollection of every possible allowed argument and its default value? * How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing? I see there's also a Julia interface for example, which just like R, uses base1 indexing, but from a brief look at the docs it looks like it sticks with base0 indexing for `xgboost` regardless. * Since the R object from the high-level interface in this case would have more metadata than what's kept in the serialized C `Booster`, this in theory could lead to issues when updating package versions and loading objects from an earlier version - for example, if a new field is added to the object class returned from `xgboost()`, an earlier object saved with `saveRDS` will not have such a field, which might lead to issues if it is assumed to exist. It'd be theoretically possible to add some function to auto-fill newly added fields as they are added to the R interface when e.g. restoring the booster handle, but this could potentially translate into a lot of maintenance work and be quite hard to test amnd easy to miss when adding new features. * How does the python scikit-learn interface deal with `pickle` and object attributes that aren't part of the C `Booster`? How does it deal with converting between `Booster` and scikit-learn classes? * I see the R interface currently has a big notice about objects serialized with non-xgboost functions like `saveRDS` not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C `Booster`, but there could be cases in which the R interface changes independently of the C struct. * If one were to load a C `Booster` into the high-level interface and there's no metadata to take, I guess the most logical thing would be to fill with generic names "V1..N", "1..N", etc. like base R does, but this would not lead to a nice inter-op with the python scikit-learn interface. Does the python interface or other interfaces keep extra metadata that the `Booster` wouldn't? Is it somehow standardized? * How does `xgboost` determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it? * Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a `DMatrix`? If so, I guess support for `arrow` in R could be delayed until such a route is implemented. I guess the python interface would also benefit from such a functionality being available at the C level.

Some other questions for R users (@mayer79 ?): * Are there any popular data.frame-like objects from packages related to GPU computing (like `cuDF` for python) or distributed computing (other than spark) that I might perhaps be missing in this post? * Is there any popular R type for learning-to-rank tasks that would involve the `qid` and position information that `xgboost` uses? I am not aware of any but I'm no expert in this area. * Is there perhaps some object class from Bioconductor or from some other domain (like, don't know, geopositioning from e.g. gdal) that should logically map to some specific xgboost objective without undergoing transformation by the user to e.g. `factor` / `numeric` / etc. ? * Is there some package offering a custom formula parser meant for decision tree models?
hcho3 commented 10 months ago

I am the person who added the section a-compatibility-note-for-saveRDS-save to the manual, so let me answer the question regarding serialization.

How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster?

Does the python interface or other interfaces keep extra metadata that the Booster wouldn't?

The sklearn interface of XGBoost extracts the necessary key-value pairs from the object attributes, serialize them as a JSON string, and store the JSON string in the attributes field of the C Booster object.

Admittedly this approach is not foolproof. This is why we ask users to save the model using XGBoost's native serializer (load_model() / save_model()). It must be noted that the base scikit-learn package provides no guarantee of backward compatibility for pickled model objects. Hence, Python users are accustomed to using pickle for only short-term storage and other forms of storage for long-term storage.

It appears that the expectations are somewhat different in the R community. After seeing lots of bug reports arising from calling saveRDS on old model files (#5794), I had to add the big warning about saveRDS in a separate section a-compatibility-note-for-saveRDS-save.

I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C Booster, but there could be cases in which the R interface changes independently of the C struct.

Checking the version of the C Booster is sufficient, because we always bump the version of the native package and the R package in lockstep. For example, when the native package had its version bumped from 1.3.0 to 1.4.0, the R package had its version bumped from 1.3.0.1 to 1.4.0.1.

How does it deal with converting between Booster and scikit-learn classes?

You can extract the Booster class by calling get_booster() method of XGBClassifier / XGBRegressor. To convert Booster to XGBClassifier / XGBRegressor, first persist the Booster to a JSON file and then call load_model().

trivialfis commented 10 months ago

Let me try to clear some of the questions before making any comments. ;-)

Does xgboost support passing a mixture of dense and sparse data together?

No, and it's unlikely in the near future. XGBoost mostly works with row-major data with some limited support for column-major (like CSC, cuDF), but partitioning data by column is a different story. One of the goals of categorical data support is to eliminate the need for dummy encoding.

Is there a C-level DMatrix equivalent of cbind / np.c_ ? I don't see any but this would make things easier.

No. The DMatrix object doesn't provide any means of data manipulation, it's just an internal storage class for a machine learning library. Any feature that is not directly related to ML (like feature engineering) is considered out-of-scope. There's a slice method for the cv function, which we would really like to remove. If the input data is a numpy array, use numpy to manipulate it.

Can sparse columns be taken as categorical without being densified?

Yes. Internally, XGBoost marks the maximum coded (like ordinal-encoding) categorical value and uses it as the number of categories in the respective column. The format of the column is orthogonal. We want to allow some extra information from users like a user-specified number of categories in the future, but this should not block the current R interface development. (we will use the feature_type field)

I am wondering if there should be a way to support these without loss of precision

No. Internally, the decision tree is always split using f32. In the long run, we can probably support f64 after the removal of binary serialization. But so far it hasn't been a priority since in most cases some normalization for the data can get us very far.

Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification?

You can query the objective using XGBoosterSaveConfig, which is exposed in the current R interface xgb.config. It involves some JSON parsing though. You can cache the result in a R object during model loading and after model training if it's performance-sensitive.

How does the python interface keep the list of parameters up-to-date with the C interface?

Mostly by hand, unfortunately. "Mostly" since we have the native interface, which accepts a dictionary of parameters and there's no need to copy and paste the parameters for the native interface. The scikit-learn interface requires explicitly listed parameters for some scikit-learn magic to work (like Python inspect), and as a result, you see a list of parameters in the skl interface.

Or will it need to manually keep recollection of every possible allowed argument and its default value?

No need to explicitly set the default value. Default values are set inside libxgboost (c++), not passing the key into XGBoost means using the default value. The Python interface uses None as a value to denote a parameter that should not be passed into libxgboost.

How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing?

I don't know, you will have to reach out for questions. @ExpandingMan :-)

How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster

We insist that one should not use pickle for long-term storage. An update to the cPython interpreter itself might make the pickle object invalid due to invalid instructions. However, we do maintain a policy in C++ to support it with best-effort, here it's:

However, what happens in R/Python is beyond our scope. Please take a look at our document doc/tutorials//saving_model.rst or just continue with the following answers, which should provide some more hints.

How does it deal with converting between Booster and scikit-learn classes?

get_booster as suggested by @hcho3 . In addition, the scikit-learn interface doesn't have much metadata and can be fully functioning by loading a Booster object. Unlike scikit-learn, XGBoost distinguishes model parameters and meta-parameters. The former is related to model output and shape like objective, number of trees, and number of features, while the latter is related to training and runtime like learning-rate, number of threads, etc. The load_model makes sure the former is complete and discards later after training. We don't need learning-rate after training, and the number of threads should be discarded since we want the model to be portable across machines.

I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions

hopefully, the previous two answers clear this up as well.

Does the python interface or other interfaces keep extra metadata that the Booster wouldn't

Yes, Python scikit-learn is using attributes. But it's not important, we can remove it anytime we want, it's a legacy. As we care about only the model, not the hyper-parameter as mentioned previously. As long as the loaded model can do inference properly (hence contains the full model), we are somewhat satisfied.

Is it somehow standardized?

No. It took some time before we got where we are, to gradually formalize what should be saved. Please take a look at our document: doc/tutorials//saving_model.rst. The save_model and load_model is guaranteed to work across bindings, other custom fields are not.

If you think a new field should work across bindings, please share, we can always make additions. A recent example is feature_types, which is saved into the JSON model for categorical data support.

How does xgboost determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it

At the moment, it's the row order, we can have extra fields later. It's a new feature in the recently announced 2.0, we are still hoping to collect feedback, it's still a hotly researched area and nothing is settled (position-aware ltr).

Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a DMatrix

I removed it very recently. (if something is missing, it's me to be blamed. -.- ). I'm familiar with the internal of arrow for both CPU and GPU memory due to a different project. It chunks data for each column non-uniformly and has layered buffers. It's quite difficult for XGBoost to support such data format without first performing a concatenation for all the chunks. It's probably a good design for data management frameworks like spark/dask, but not ideal for a traditional ML algorithm. On the other hand, if we perform a concatenation, there's no point in using the arrow's C interface. We can just pass the memory buffer's address. We have one good example of how a contiguous (without chunks) arrow-based table can be passed into XGBoost using cuDF, it dumps out the memory addresses for each contiguous column as __cuda_array_interface__, and XGBoost can consume them with zero copy for inplace-predict and QuantileDMatrix.

trivialfis commented 10 months ago

Comments unrelated to questions. I will update this comment when I can put together some more thoughts.

If I'm not mistaken, CSR is supported for prediction but not for training

CSR is well supported for all purposes.

Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept

Depending on the use case, one can fix the intercept to a constant (like 0) value or let xgboost find one based on MLE.

david-cortes commented 10 months ago

Thanks for the answers.

So it sounds like a good way to approach serialization and transfers between interfaces would be to keep as many attributes as possible inside of the C booster, then have a function that default-initializes them, and then fills them with info from the booster if available at the moment it gets deserialized. This unfortunately wouldn't work for things like R formulas, so one would strictly need to use saveRDS (as opposed to xgb.save) in such cases however.

As a preliminary step, I think it would make sense to extend the the current optional attributes that are kept inside of the C booster to include categories of both y and categorical columns of X.

Scikit-learn currently accepts pd.Categorical and uses the categories when passed as y in some estimators. It currently doesn't keep recollection of categories for categorical columns of X, but it has slowly been moving in the direction of leveraging more and more pandas in its estimators and transformers.

I think it could even be handy to keep levels of categorical features as part of the metadata when using the scikit-learn interface and passing X that has pandas categorical types, even if they are not currently used (but perhaps could later on be used as column names in outputs from the scikit-learn interface?).

(I've opened a separate issue for the y categories: https://github.com/dmlc/xgboost/issues/9739)

Note that this might not be as straightforward as it sounds, as pandas allows any arbitrary type to be used as categories, not just strings, whereas R only allows strings, so perhaps it could further have the restriction that the categories would be coerced to strings.

ExpandingMan commented 10 months ago

I didn't bother to read most of this thread, but I was pinged concerning 1-based indexing, as I am one of the maintainers of the Julia wrapper, Julia being a 1-based indexed language.

Nearly all Julia packages whether wrappers or not, stick with purely 1-based index semantics. In the case of wrappers like XGBoost.jl, this simply means that the wrapper functions need to translate the indexing. There is a relatively low level direct wrapper of the libxgboost C-lib that does not do this, but it is considered private to the wrapper package.

but from a brief look at the docs it looks like it sticks with base0 indexing for xgboost regardless.

I'm not sure what you're referring to, but no, that's not the case. The aforementioned low level C wrapper should not show up in the docs, libxgboost is small enough that the intention of XGBoost.jl is that nobody should ever have to use that.

trivialfis commented 10 months ago

keep as many attributes as possible inside of the C booster

If the object is R-specifc, and R does such a good job of keeping saveRDS portable (in contrast to Python pickle) across versions, we can let saveRDS do its magic instead of saving them into the booster, right? The best code is the code that we don't have to maintain and yet it works. ;-) Fun aside, I'm fine with either approach though, will leave it to you to decide.

I'm not sure what you're referring to, but no, that's not the case.

Thank you for clearing that up @ExpandingMan ! That helps a lot.

david-cortes commented 10 months ago

@mayer79 would be nice to hear some comments about the proposed interface here, especially around handling of formulas.

mayer79 commented 10 months ago

@david-cortes : I was a couple of days afk and will give my feedback soon. Thank you so much for starting this thread. We might get some inspiration of https://github.com/MathiasAmbuehl/modeltuner (no possibility to select how to encode categoricals - it just uses OHE), but otherwise it looks neat.

mayer79 commented 10 months ago

Curiously, I don't know the answers to the questions to me. But I think they are less important.

My current mood:

  1. Most of what have been suggested above means: Rewrite the xgboost() high level API and add a corresponding xgboost.cv() function. We could do this in the current package, without the need of maintaining a second {xgboost2} package.
  2. Categorical support should only become the default as soon as SHAP values are available for it. The {shap} library has almost as many pypi downloads as {xgboost}. Actually, I think that almost every new functionality should also make SHAP work.
  3. We can use a simple formula parser, see code below. No need for interactions, transformations etc. To control the categorical encoding of factors, we can use an argument encode_factors = c("integer", "target", "ohe"). Maybe this needs to be vectorized, so that users could specify encode_factors = c(Species = "integer", ...) to use different encodings per discrete feature.
  4. I actually like {ranger}'s interface: It accepts either formula + data or x + y. (I don't like its formula parser though).
  5. Monotonicity and interaction constraints could be passed as column names only, like: monotone_inc = c("Sepal.Width", ...), monotone_dec = c(...), interaction_constraints = list(c("Sepal.Width"), ...).
  6. In the high-level API, we would need to subset feature columns so that tibbles and data.tables are fine. This is usually not complicated.

Formula parser (dropping all transformations), just extracting column names:

formula_parser <- function(form, data) {
  form <- as.character(form)
  y <- form[2L]
  all_vars <- all.vars(
    stats::terms.formula(stats::reformulate(form[3L]), data = data[1L, ])
  )
  list(y = y, x = setdiff(all_vars, y))
}

formula_parser(Sepal.Width ~ ., data = iris)
# $y
# [1] "Sepal.Width"
# 
# $x
# [1] "Sepal.Length" "Petal.Length" "Petal.Width"  "Species"
david-cortes commented 10 months ago

Regarding point (1), under the proposal here, some of the implications are that:

Thus, it might not be necessary to create a new package, but then it'd be necessary to work with reverse dependencies to adapt.


Regarding point (2), I do not think that withholding categorical support for now would be beneficial. At some point, the core xgboost library will have wider support for things that involve categorical variables and will hopefully have it enabled by default. We haven't even started work on the R interface yet and we don't know when it will be ready.

Also, in my view, I would not consider SHAP support as a blocker - there's plenty of use-cases for xgboost that benefit from better handling of categorical features and which do not involve calculating SHAP values; and if needed, an external library could be used for it even if it might not be optimal.


Regarding the proposed parser for formulas: I do not feel that such an interface would bring value to users if it's only used to select variables.

In my opinion, one of the most useful aspects of formulas is that they can act as quick data preparation pipelines, which are helpful when e.g. prototyping models in an interactive session - for example:

In these cases, the extra processing would be lost, and the selection of variables could be achieved just as easily through e.g. dplyr::select or data.table[,.(...)] and passed to an x/y interface instead.

What's more, if the formula were to be used to one-hot-encode categorical variables, it may be less efficient than what a user could achieve by e.g. using a sparse encoder instead of base R's when appropriate.

Also, choosing the way of handling categorical variables like that would then raise the question of what happens when parameter max_cat_to_onehot is supplied, and I imagine that more such parameters might appear in the future (like cat_smooth in lightgbm).

As such, I am thinking that if a formula parser is too complex to implement and there isn't any readily available parser with a compatible license, it might be better to not develop a formula interface, or maybe leave it for a future version.

trivialfis commented 10 months ago

A side note, we support SHAP values with categorical data for the original SHAP algorithm (including interaction). The issue is the plotting library needs some additional work to read the model properly. We can work on the SHAP package after clearing up some priorities.

mayer79 commented 10 months ago

@trivialfis Nice! So we need categorical support in R indeed at higher priority. In R, shapviz already works with categoricals, e.g., for LightGBM. So at least the plotting of SHAP values is ready in R.

david-cortes commented 10 months ago

I just realized that the ALTREP system in newer R versions can be used (albeit in a rather hacky way) to manage custom serialization of arbitrary objects, provided that the user doesn't start interacting with and modifying those objects through R functions.

Would be ideal if xgboost handles could be made to only trigger a serialization when calling a method like saveRDS instead of always materializing the serialized bytes on every model creation call; and to automatically re-create the C++ object and handle when calling readRDS without having to call xgb.Booster.complete, like the python package does with pickle's get/set state.

Here I've composed a small example of how to do this with externalptr objects: https://github.com/david-cortes/altreppedpointer (so far, haven't found any R package using this sort of mechanism though)

Note that this will require changing the class and the structure of the current handles, so a first step would be to make these classes internal-only as mentioned in the first message.

trivialfis commented 10 months ago

I just realized that the ALTREP system in newer R versions can be used

Thank you for sharing! That's quite interesting! The flexibility is a bit scary though, probably just due to my very limited understanding. I can learn more about the framework. I think if we were to use it, some documentation for future developers would be nice.

Would be ideal if xgboost handles could be made to only trigger a serialization when calling a method

Indeed, that would be extremely helpful in keeping the states consistent.

As for error handling, I think the only concern is that XGB is not great at forward compatibility, loading a new model with an older version of XGB might result in errors (C API returning -1). OOM is a bit more tricky on lazy OS like Linux based OSs, the exception might not be caught by XGB. But I guess it's a general issue for all packages with CPP.

david-cortes commented 9 months ago

As we haven't received any further comments about these proposals, guess it's reasonable to start with a plan and checklist of things to do based on the current proposal.

As I see it, the following would be a high-level overview of what needs to be done:

Other topics from the earlier issue which aren't mentioned in the list above:


Would like to hear some comments about it, particularly around DMatrix-related topics, and would be very helpful if xgboost maintainers could work on the first issue in this list.

After that, guess we can create a roadmap issue to keep track of progress and discuss about plans for working on these issues from different people. I think there's a github functionality for it but I've never used it and don't know how to so perhaps maintainers could help here.

trivialfis commented 9 months ago

Thank you for writing a comprehensive list! I'm really excited for the proposed changes here. Let me try to comment on some of the items considered to be high priority:

Create an easy-to-use script to create an R package or to otherwise re-arrange files

One can run R CMD install . under the package directory, devtools can load the package from path as well. The script is not necessary for installing from source. I think it's possible to suppress the Python compilation. Lastly, if we were to rewrite it due to the use of Python, it would be great if we can use R instead of make file or shell.

Parameters for categorical features

If memory serves, I think the call to set types is already implemented in the current code base. We need additional support for recognizing factor automatically. I can double check later and share how to set the types.

Fitting a model from a DMatrix created from a CSR matrix

I will debug it, do you have a specific example?

Add a mechanism to create a DMatrix from an R data.frame.

I will work on it, but would like to have some feedbacks from you as well. Like is there any difference in the underlying arrays between table and frame?

After that, guess we can create a roadmap issue to keep track of progress and discuss about plans for working on these issues from different people.

Feel free to pick whichever that fits your workflow best. One can use Todo items, multiple issues, or GitHub project to track progress. Tips: this is an to-do item

I think there's a button in GitHub to split an issue with multiple items into multiple issues. I can help create a GitHub classical project to keep track of issues if needed. Feel free to ping me.

david-cortes commented 9 months ago

One can run R CMD install . under the package directory, devtools can load the package from path as well.

Thanks for the hint. Can confirm that R CMD INSTALL . from R-package/ works as expected under the new build system, as do RStudio's analog which uses devtools and the variants for running tests and building docs.

I will debug it, do you have a specific example?

Actually, this seems to have been already fixed when creating a DMatrix first, but this will still error out:

library(xgboost)
library(Matrix)
data("mtcars")
y <- mtcars$mpg
x <- mtcars[, -1]
x.sp <- as(as.matrix(x), "RsparseMatrix")
model <- xgboost(data=x.sp, label=y)

Nevertheless, since we want to remove the code behind this function, I guess it's not worth fixing at this point.

I will work on it, but would like to have some feedbacks from you as well. Like is there any difference in the underlying arrays between table and frame?

There is no difference in the arrays, both are just a list (an R array object which in C has internal type VECSXP and whose elements are retrieved through VECTOR_ELT) containing other SEXP objects whose pointers are retrievable through e.g. REAL(x), INTEGER(x), etc. Note that there's also a popular package bit64 that provides int64_t arrays.

Objects from data.table do have some differences though in that they can contain lists of lists and other types, but those are anyway not supported by xgboost; and in other small details which shouldn't matter for DMatrix such as potentially having hashes for some column or not having row names.

In terms of creating DMatrix from data.frame, I guess the most reasonable route would be to only accept columns of types numeric + integer + factor + bit64::integer64, leaving potentially character for the high-level interface. Note that float::float32 cannot be put inside a data.frame, unlike bit64.

I will create a separate roadmap issue with checklists to discuss work on these issues and progress.

david-cortes commented 9 months ago

A couple things I am wondering:

A. xgboost supports the sorted indices splitting algorithm (tree_method="exact").

Under this algorithm, infinite values should simply be treated as greater or smaller than every other value, and since the raw values aren't used in any other statistic, it should be possible to fit a model with them.

But it seems that xgboost will not accept infinite values regardless in the DMatrix construction. Are infinite non-missing values never accepted in input features?


B. I see that there are functions like XGBoosterPredictFromDMatrix for which part of the output is a pointer to an array of values. In the R interface, it doesn't look like the pointer that gets passed and filled inside the call would be freed after.

Does that kind of function allocate arrays inside of the booster pointer? If so, does it mean that predicting on larger inputs makes the model heavier in memory?


C. I see there's some parallelization in the R wrapper for things like converting signed integers to unsigned.

Part of the current operations include retrieving a pointer to the data in every iteration of the loop, but if that part were to be removed by pre-retrieving the pointer, would that parallelization actually result in faster operations?

As far as I understand, for a regular amd64 platform, that sort of operation should be so fast that it's basically limited by the speed at which memory is retrieved from RAM, and having multiple threads retrieving from RAM shouldn't make it faster.

trivialfis commented 9 months ago

But it seems that xgboost will not accept infinite values regardless in the DMatrix construction

I disabled the inf input due to lack of tests and other algorithms running into issues, to properly support it we need to have test cases for all potential uses, including things like SHAP, categorical features, etc. At the moment, it's not a very high priority. A use case would help.

it doesn't look like the pointer that gets passed and filled inside the call would be freed after

It's using thread-local static memory as a return buffer for thread-safe inference. The system will free it.

static thread_local int myvar;

As far as I understand, for a regular amd64 platform, that sort of operation should be so fast

I agree. Feel free to remove them.

but if that part were to be removed by pre-retrieving the pointer

I think we can entirely remove the data copies by using array interface, which is also supported by the SetInfo. As a bonus, it saves memory usage as well thanks to being able to avoid allocating a new buffer. Memory usage has been a huge headache since some users are quite used to deep-learning-like batch training and find it frustrating that gradient boosting needs the entire dataset in memory to be efficient. Most of the type-specific C API are legacy now.

david-cortes commented 9 months ago

Another question: are there any guidelines for using SIMD code in xgboost?

I see for example there's a function in the R wrapper that substracts 1 from elements in an array: https://github.com/dmlc/xgboost/blob/95af5c074be363bddc439fa3d1c1460a29cc818e/R-package/src/xgboost_R.cc#L286

I was thinking of adding #pragma omp simd, but was wondering if perhaps xgboost includes some SIMD module with dynamic dispatching per CPU instruction set or similar.

trivialfis commented 9 months ago

No, we don't have those. Feel free to add it.

I thought about having an aligned allocator, but dropped the work as my attempt of using sse did not speedup the histogram kernel. Another attempt of using it is to speedup allreduce on CPU, the difference was quite insignificant from my benchmark, so it was dropped again.

trivialfis commented 9 months ago

HI @david-cortes , are you familiar with the error from CRAN check caused by calling error/Rf_error:

  xgboost_R.cc:70:3: warning: format string is not a string literal (potentially insecure) [-Wformat-security]

I'm not sure how to fix it without removing the use of the error function.

https://cran.r-project.org/web/checks/check_results_xgboost.html

Update I will try Rf_error("%s", msg) with rhub.

jameslamb commented 9 months ago

Rf_error("%s", msg) worked for us over in LightGBM when faced with the same warning:

https://github.com/microsoft/LightGBM/pull/6216

We have a CI job there that replicates the Debian clang r-devel build from CRAN.

david-cortes commented 9 months ago

Yes, that seems to be the usual clang formatting warning, which is solved by using "%s", <str> instead of just <str>. That %s is also the solution that's being proposed at Rcpp: https://github.com/RcppCore/Rcpp/issues/1287

david-cortes commented 9 months ago

@trivialfis In the previous thread, you mentioned this as one of the desired things to implement for the R interface:

quantile regression with multiple quantiles.

Is this supported in the python interface? Looks like it errors out if you try to pass multiple quantiles.

If this is supported in python, could you provide an example of what this means?

david-cortes commented 9 months ago

Another question: does xgboost support multi-output survival objectives? Looks like DMatrix won't take multi-column inputs in the C interface for bounds.

I understand multi-CoxPH would not be a trivial thing to do given the interdependencies between rows, and AFT would require a non-trivial multivariate Newton solver for the global parameters with interdependencies between columns (should be quite feasible though), but what if the user were to provide a custom survival objective that works with multiple columns?

trivialfis commented 9 months ago

If this is supported in python, could you provide an example of what this means?

The example in the guide-pythonshould work.

As for future support of survival training with built-in objective, for the old cox regression, it should not be too difficult. But for the set of AFT objectives with lower and upper bounds, I need to take a deeper look before answering.

As for custom objective, XGB supports custom objective with matrix gradient (along with the usual vector), as long as you can formulate your problem to return the right shape of gradient, XGBoost is happy to ignore what's in the label.

trivialfis commented 9 months ago

Before we figure out what to do with the CRAN package, do you think it's appropriate to submit XGBoost to r-universe so that more people can access the new interface with iterative development?

david-cortes commented 9 months ago

Before we figure out what to do with the CRAN package, do you think it's appropriate to submit XGBoost to r-universe so that more people can access the new interface with iterative development?

Since it's a work in progress that's seeing continuous updates and hasn't yet gotten the major ones, I don't think it makes sense yet to release to R universe. I am guessing that people who want the latest would anyway go to github first.

david-cortes commented 9 months ago

Couple other questions:

david-cortes commented 9 months ago

Yet another question: is there a way to serialize a DMatrix without creating a file? I see that there's a function XGDMatrixSaveBinary, but I cannot find any equivalent for saving that to char bytes.

Would be useful to have such a function as then both the R and Python interfaces would be able to serialize such objects (which is helpful for example when one wishes to save an R session to resume it later, for example).

I see the python interface currently would throw this error:

ctypes objects containing pointers cannot be pickled

so it could also help from having __getstate__ / __setstate__ methods.

trivialfis commented 9 months ago

Since it's a work in progress that's seeing continuous updates and hasn't yet gotten the major ones

Sounds good. We can discuss it then. Thank you for the continuous update so far! Looks there's new life in the R package. ;-)

Is there some scenario where xgboost would accept a vector

No. Not until we can remove the old binary serialization format. https://github.com/dmlc/xgboost/issues/7547

Is there some scenario where an array with more than 2 dimensions

Not yet. We don't have a use case yet. Maybe in the future, we will have multi-target-multi-quantile quantile regression, or some multi-target forecasting with multi-parameter distributions.

is there a way to serialize a DMatrix without creating a file?

No, is it necessary to serialize the DMatrix? Can we recreate it from source data when needed? As previously discussed, we would like to keep the scope of this class as small as possible.

I see the python interface currently would throw this error:

So far I deliberately keep it this way so that Python doesn't secretly serialize it, which can cause troubles in distributed environments when the framework is throwing data around workers.

david-cortes commented 9 months ago

I can certainly see how adding serialization to DMatrix could potentially lead to bad behaviors where the data is de-serialized twice or similar, but at least in R, there are some cases where this might be desirable even if it comes with a performance impact:

Now, the user can always just serialize the data source and re-create the DMatrix, and that's most likely going to be the most optimal route, but could be a surprise to a user who isn't familiar with xgboost.

Also, if I understand it correctly, if QuantileDMatrix were to be implemented, and a user wishes to share these between processes in a parallel backend, wouldn't serialization of that QuantileDMatrix be faster than serializing the whole data as-is?

trivialfis commented 9 months ago

but could be a surprise to a user who isn't familiar with xgboost.

If the R package is transitioning into a high-level interface, the user is unlikely to interact with DMatrix right? The class is more or less an implementation detail instead of a user-controlled object. For instance, the sklearn interface in XGBoost dispatch between DMatrix and QuantileDMatrix internally inside the fit method without users knowing the class.

david-cortes commented 9 months ago

That's a good point. Let's then better leave it without char serialization to avoid inefficiencies.

trivialfis commented 9 months ago

One question, on Python, we are quite sensitive to memory usage since users expect libraries to work with huge data inputs, and so we have done much work to reduce the memory usage, sometimes willing to sacrifice the performance a little bit. Both the QuantileDMatrix and inplace_predict are for memory usage reduction. However, on R, I think many users (I might be wrong) have different priorities. From the introductory books I read about R, memory usage and performance are of lesser concern, and the presentation of mathematical objects is much more important. For instance, the tidyverse proudly works with small data. As a result, one doesn't need to pick all these optimizations from Python, or doesn't need to have them in the first couple of iterations. I'm curious about your thoughts @david-cortes @jameslamb .

david-cortes commented 9 months ago

I guess you are right that RStudio-developed packages tend not to be geared towards computational efficiency, and the average R user is probably not dealing with large datasets either, but there's still subsets of users striving to develop computationally efficient R code for large data as evidenced by the popularity of R packages like DuckDB and arrow.

Even in base R, things have been moving towards memory efficiency lately through e.g. the ALTREP system which is further allowing packages like arrow to avoid memory copies and such.

That being said, I would consider memory efficiency in the R package to be a much lower priority than features that affect the results of a model.

jameslamb commented 9 months ago

I completely agree with @david-cortes 's summary.

In situations where you're forced to choose between memory efficiency and usability (i.e. compatibility with other R libraries or base R), I think it's totally defensible to choose usability.

trivialfis commented 9 months ago

There are two methods I found useful for the DMatrix class that are not yet exposed in R:

david-cortes commented 9 months ago

There are two methods I found useful for the DMatrix class that are not yet exposed in R:

  • XGDMatrixNumNonMissing: Get the number of non-missing values.
  • XGDMatrixGetDataAsCSR: Return the DMatrix as a CSR matrix. They are convenient for testing purposes. Do you think we should expose it given that we are transitioning into a high-level interface?

Sure, it looks like they should be quite easy to implement.

trivialfis commented 9 months ago

Got it, will add them to the roadmap and work on it.

david-cortes commented 9 months ago

I'm having a hard time wrapping my head around how QuantileDMatrix creation works and what is or isn't required to pass to them.

Would it be possible to have C functions to create these QuantileDMatrices from array_interface strings for dense/csr data sources + required parameters? Is there some limitation that would make such an approach not work?

trivialfis commented 9 months ago

Let's continue the discussion around QDM in https://github.com/dmlc/xgboost/pull/9864 .

david-cortes commented 9 months ago

Question: when using categorical features, would the DMatrix passed to C prediction functions need to have feature_types set on it? What would happen if it is set in the prediction DMatrix, but these feature types differ from the ones that were used for training?

david-cortes commented 9 months ago

Yet another question: is there something currently preventing using field feature_type in the R interface?

Something like this seems to work right now, but I don't know if the feature types are actually getting used:

library(xgboost)
data(iris)
y <- iris$Sepal.Length
x <- iris[, -1]
x$Species <- as.numeric(x$Species)
x <- as.matrix(x)
dm <- xgb.DMatrix(data=x, label=y)
setinfo(dm, "feature_type", c("q", "q", "q", "c"))
model <- xgb.train(
    data=dm,
    nrounds=5
)
pred_mat <- predict(model, dm)
david-cortes commented 9 months ago

A couple perhaps stupid questions that I'm wondering after issue https://github.com/dmlc/xgboost/issues/9869 and which I would like to somehow add to the docs:

trivialfis commented 9 months ago

Let me add documentation instead of writing with GH comments, always nice to have fresh eyes to catch the lack of documents.

trivialfis commented 8 months ago

It seems CMake is packed into rtools43 for windows? I found this while wrangling with MSVC:

C:\rtools43\usr\lib\mxe\usr\x86_64-w64-mingw32.static.posix\bin\cmake.exe

david-cortes commented 8 months ago

I'm not sure if RTools itself bundles CMake, but it should be possible to have it as a dependency under 'SystemRequirements' and then CRAN will ensure that it is available when it produces its pre-compiled binaries.

There's a section about CMake in the R extensions manual, but I'm not sure what's the right way to get an R package to use CMake: https://cran.r-project.org/doc/manuals/R-exts.html#Using-cmake

trivialfis commented 8 months ago

Thank you for sharing! I appreciate the reference. As long as CRAN is confortable with the use of CMake, I'm sure there's a way to sort out all the technical details.