Open david-cortes opened 10 months ago
I am the person who added the section a-compatibility-note-for-saveRDS-save
to the manual, so let me answer the question regarding serialization.
How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster?
Does the python interface or other interfaces keep extra metadata that the Booster wouldn't?
The sklearn interface of XGBoost extracts the necessary key-value pairs from the object attributes, serialize them as a JSON string, and store the JSON string in the attributes
field of the C Booster
object.
Admittedly this approach is not foolproof. This is why we ask users to save the model using XGBoost's native serializer (load_model()
/ save_model()
). It must be noted that the base scikit-learn package provides no guarantee of backward compatibility for pickled model objects. Hence, Python users are accustomed to using pickle
for only short-term storage and other forms of storage for long-term storage.
It appears that the expectations are somewhat different in the R community. After seeing lots of bug reports arising from calling saveRDS on old model files (#5794), I had to add the big warning about saveRDS
in a separate section a-compatibility-note-for-saveRDS-save
.
I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C Booster, but there could be cases in which the R interface changes independently of the C struct.
Checking the version of the C Booster is sufficient, because we always bump the version of the native package and the R package in lockstep. For example, when the native package had its version bumped from 1.3.0
to 1.4.0
, the R package had its version bumped from 1.3.0.1
to 1.4.0.1
.
How does it deal with converting between Booster and scikit-learn classes?
You can extract the Booster
class by calling get_booster()
method of XGBClassifier
/ XGBRegressor
. To convert Booster
to XGBClassifier
/ XGBRegressor
, first persist the Booster to a JSON file and then call load_model()
.
Let me try to clear some of the questions before making any comments. ;-)
Does xgboost support passing a mixture of dense and sparse data together?
No, and it's unlikely in the near future. XGBoost mostly works with row-major data with some limited support for column-major (like CSC, cuDF), but partitioning data by column is a different story. One of the goals of categorical data support is to eliminate the need for dummy encoding.
Is there a C-level DMatrix equivalent of cbind / np.c_ ? I don't see any but this would make things easier.
No. The DMatrix
object doesn't provide any means of data manipulation, it's just an
internal storage class for a machine learning library. Any feature that is not directly related to
ML (like feature engineering) is considered out-of-scope. There's a slice method for the cv
function, which we
would really like to remove. If the input data is a numpy array, use numpy to manipulate it.
Can sparse columns be taken as categorical without being densified?
Yes. Internally, XGBoost marks the maximum coded (like ordinal-encoding) categorical value and uses it as the number of categories in the respective column. The format of the column is orthogonal. We want to allow some extra information from users like a user-specified number of categories in the future, but this should not block the current R interface development. (we will use the feature_type field)
I am wondering if there should be a way to support these without loss of precision
No. Internally, the decision tree is always split using f32. In the long run, we can probably support f64 after the removal of binary serialization. But so far it hasn't been a priority since in most cases some normalization for the data can get us very far.
Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification?
You can query the objective using XGBoosterSaveConfig
, which is exposed in the current R
interface xgb.config
. It involves some JSON parsing though. You can cache the result in
a R object during model loading and after model training if it's performance-sensitive.
How does the python interface keep the list of parameters up-to-date with the C interface?
Mostly by hand, unfortunately. "Mostly" since we have the native interface, which accepts a dictionary of parameters and there's no need to copy and paste the parameters for the native interface. The scikit-learn interface requires explicitly listed parameters for some scikit-learn magic to work (like Python inspect), and as a result, you see a list of parameters in the skl interface.
Or will it need to manually keep recollection of every possible allowed argument and its default value?
No need to explicitly set the default value. Default values are set inside libxgboost
(c++), not passing the key into XGBoost means using the default value. The Python interface
uses None
as a value to denote a parameter that should not be passed into libxgboost.
How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing?
I don't know, you will have to reach out for questions. @ExpandingMan :-)
How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster
We insist that one should not use pickle for long-term storage. An update to the cPython interpreter itself might make the pickle object invalid due to invalid instructions. However, we do maintain a policy in C++ to support it with best-effort, here it's:
However, what happens in R/Python is beyond our scope. Please take a look at our
document doc/tutorials//saving_model.rst
or just continue with the following answers,
which should provide some more hints.
How does it deal with converting between Booster and scikit-learn classes?
get_booster
as suggested by @hcho3 . In addition, the scikit-learn interface doesn't
have much metadata and can be fully functioning by loading a Booster
object. Unlike
scikit-learn, XGBoost distinguishes model parameters and meta-parameters. The former is
related to model output and shape like objective, number of trees, and number of features,
while the latter is related to training and runtime like learning-rate, number of threads,
etc. The load_model
makes sure the former is complete and discards later after
training. We don't need learning-rate after training, and the number of threads should be
discarded since we want the model to be portable across machines.
I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions
hopefully, the previous two answers clear this up as well.
Does the python interface or other interfaces keep extra metadata that the Booster wouldn't
Yes, Python scikit-learn is using attributes. But it's not important, we can remove it anytime we want, it's a legacy. As we care about only the model, not the hyper-parameter as mentioned previously. As long as the loaded model can do inference properly (hence contains the full model), we are somewhat satisfied.
Is it somehow standardized?
No. It took some time before we got where we are, to gradually formalize what should be
saved. Please take a look at our document: doc/tutorials//saving_model.rst
. The
save_model
and load_model
is guaranteed to work across bindings, other custom fields
are not.
If you think a new field should work across bindings, please share, we can always make
additions. A recent example is feature_types
, which is saved into the JSON model for
categorical data support.
How does xgboost determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it
At the moment, it's the row order, we can have extra fields later. It's a new feature in the recently announced 2.0, we are still hoping to collect feedback, it's still a hotly researched area and nothing is settled (position-aware ltr).
Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a DMatrix
I removed it very recently. (if something is missing, it's me to be blamed. -.- ). I'm
familiar with the internal of arrow for both CPU and GPU memory due to a different
project. It chunks data for each column non-uniformly and has layered buffers. It's quite
difficult for XGBoost to support such data format without first performing a concatenation
for all the chunks. It's probably a good design for data management frameworks like
spark/dask, but not ideal for a traditional ML algorithm. On the other hand, if we
perform a concatenation, there's no point in using the arrow's C interface. We can just
pass the memory buffer's address. We have one good example of how a contiguous (without
chunks) arrow-based table can be passed into XGBoost using cuDF, it dumps out the memory
addresses for each contiguous column as __cuda_array_interface__
, and XGBoost can
consume them with zero copy for inplace-predict and QuantileDMatrix
.
Comments unrelated to questions. I will update this comment when I can put together some more thoughts.
If I'm not mistaken, CSR is supported for prediction but not for training
CSR is well supported for all purposes.
Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept
Depending on the use case, one can fix the intercept to a constant (like 0) value or let xgboost find one based on MLE.
Thanks for the answers.
So it sounds like a good way to approach serialization and transfers between interfaces would be to keep as many attributes as possible inside of the C booster, then have a function that default-initializes them, and then fills them with info from the booster if available at the moment it gets deserialized. This unfortunately wouldn't work for things like R formulas, so one would strictly need to use saveRDS
(as opposed to xgb.save
) in such cases however.
As a preliminary step, I think it would make sense to extend the the current optional attributes that are kept inside of the C booster to include categories of both y
and categorical columns of X
.
Scikit-learn currently accepts pd.Categorical
and uses the categories when passed as y
in some estimators. It currently doesn't keep recollection of categories for categorical columns of X
, but it has slowly been moving in the direction of leveraging more and more pandas in its estimators and transformers.
I think it could even be handy to keep levels of categorical features as part of the metadata when using the scikit-learn interface and passing X
that has pandas categorical types, even if they are not currently used (but perhaps could later on be used as column names in outputs from the scikit-learn interface?).
(I've opened a separate issue for the y
categories: https://github.com/dmlc/xgboost/issues/9739)
Note that this might not be as straightforward as it sounds, as pandas
allows any arbitrary type to be used as categories, not just strings, whereas R only allows strings, so perhaps it could further have the restriction that the categories would be coerced to strings.
I didn't bother to read most of this thread, but I was pinged concerning 1-based indexing, as I am one of the maintainers of the Julia wrapper, Julia being a 1-based indexed language.
Nearly all Julia packages whether wrappers or not, stick with purely 1-based index semantics. In the case of wrappers like XGBoost.jl, this simply means that the wrapper functions need to translate the indexing. There is a relatively low level direct wrapper of the libxgboost
C-lib that does not do this, but it is considered private to the wrapper package.
but from a brief look at the docs it looks like it sticks with base0 indexing for xgboost regardless.
I'm not sure what you're referring to, but no, that's not the case. The aforementioned low level C wrapper should not show up in the docs, libxgboost
is small enough that the intention of XGBoost.jl is that nobody should ever have to use that.
keep as many attributes as possible inside of the C booster
If the object is R-specifc, and R does such a good job of keeping saveRDS
portable (in contrast to Python pickle) across versions, we can let saveRDS
do its magic instead of saving them into the booster, right? The best code is the code that we don't have to maintain and yet it works. ;-) Fun aside, I'm fine with either approach though, will leave it to you to decide.
I'm not sure what you're referring to, but no, that's not the case.
Thank you for clearing that up @ExpandingMan ! That helps a lot.
@mayer79 would be nice to hear some comments about the proposed interface here, especially around handling of formulas.
@david-cortes : I was a couple of days afk and will give my feedback soon. Thank you so much for starting this thread. We might get some inspiration of https://github.com/MathiasAmbuehl/modeltuner (no possibility to select how to encode categoricals - it just uses OHE), but otherwise it looks neat.
Curiously, I don't know the answers to the questions to me. But I think they are less important.
My current mood:
xgboost()
high level API and add a corresponding xgboost.cv()
function. We could do this in the current package, without the need of maintaining a second {xgboost2} package.encode_factors = c("integer", "target", "ohe")
. Maybe this needs to be vectorized, so that users could specify encode_factors = c(Species = "integer", ...)
to use different encodings per discrete feature.formula
+ data
or x
+ y
. (I don't like its formula parser though).monotone_inc = c("Sepal.Width", ...), monotone_dec = c(...)
, interaction_constraints = list(c("Sepal.Width"), ...)
.Formula parser (dropping all transformations), just extracting column names:
formula_parser <- function(form, data) {
form <- as.character(form)
y <- form[2L]
all_vars <- all.vars(
stats::terms.formula(stats::reformulate(form[3L]), data = data[1L, ])
)
list(y = y, x = setdiff(all_vars, y))
}
formula_parser(Sepal.Width ~ ., data = iris)
# $y
# [1] "Sepal.Width"
#
# $x
# [1] "Sepal.Length" "Petal.Length" "Petal.Width" "Species"
Regarding point (1), under the proposal here, some of the implications are that:
xgboost()
would return a different class from xgb.train()
(like in the python package, where the scikit-learn class is different from the Booster
class), which in turn means objects returned from it would have a different predict
method with different signature - i.e. if a reverse dependency uses xgboost()
and calls predict with some argument, it would lead to breakage.xgboost()
or xgb.train()
, which could potentially lead to breakage.Thus, it might not be necessary to create a new package, but then it'd be necessary to work with reverse dependencies to adapt.
Regarding point (2), I do not think that withholding categorical support for now would be beneficial. At some point, the core xgboost library will have wider support for things that involve categorical variables and will hopefully have it enabled by default. We haven't even started work on the R interface yet and we don't know when it will be ready.
Also, in my view, I would not consider SHAP support as a blocker - there's plenty of use-cases for xgboost that benefit from better handling of categorical features and which do not involve calculating SHAP values; and if needed, an external library could be used for it even if it might not be optimal.
Regarding the proposed parser for formulas: I do not feel that such an interface would bring value to users if it's only used to select variables.
In my opinion, one of the most useful aspects of formulas is that they can act as quick data preparation pipelines, which are helpful when e.g. prototyping models in an interactive session - for example:
y ~ I(count_of_A + count_of_B + count_of_C) + I(measure1/measure2) + ...
y ~ heigh + log(weight) + I(mass/height^2)
log(y) ~ .
I((var1 - var2)/var3) ~ abs(var4) + var5 + ...
In these cases, the extra processing would be lost, and the selection of variables could be achieved just as easily through e.g. dplyr::select
or data.table[,.(...)]
and passed to an x/y interface instead.
What's more, if the formula were to be used to one-hot-encode categorical variables, it may be less efficient than what a user could achieve by e.g. using a sparse encoder instead of base R's when appropriate.
Also, choosing the way of handling categorical variables like that would then raise the question of what happens when parameter max_cat_to_onehot
is supplied, and I imagine that more such parameters might appear in the future (like cat_smooth
in lightgbm).
As such, I am thinking that if a formula parser is too complex to implement and there isn't any readily available parser with a compatible license, it might be better to not develop a formula interface, or maybe leave it for a future version.
A side note, we support SHAP values with categorical data for the original SHAP algorithm (including interaction). The issue is the plotting library needs some additional work to read the model properly. We can work on the SHAP package after clearing up some priorities.
@trivialfis Nice! So we need categorical support in R indeed at higher priority. In R, shapviz already works with categoricals, e.g., for LightGBM. So at least the plotting of SHAP values is ready in R.
I just realized that the ALTREP system in newer R versions can be used (albeit in a rather hacky way) to manage custom serialization of arbitrary objects, provided that the user doesn't start interacting with and modifying those objects through R functions.
Would be ideal if xgboost handles could be made to only trigger a serialization when calling a method like saveRDS
instead of always materializing the serialized bytes on every model creation call; and to automatically re-create the C++ object and handle when calling readRDS
without having to call xgb.Booster.complete
, like the python package does with pickle's get/set state.
Here I've composed a small example of how to do this with externalptr
objects:
https://github.com/david-cortes/altreppedpointer
(so far, haven't found any R package using this sort of mechanism though)
Note that this will require changing the class and the structure of the current handles, so a first step would be to make these classes internal-only as mentioned in the first message.
I just realized that the ALTREP system in newer R versions can be used
Thank you for sharing! That's quite interesting! The flexibility is a bit scary though, probably just due to my very limited understanding. I can learn more about the framework. I think if we were to use it, some documentation for future developers would be nice.
Would be ideal if xgboost handles could be made to only trigger a serialization when calling a method
Indeed, that would be extremely helpful in keeping the states consistent.
As for error handling, I think the only concern is that XGB is not great at forward compatibility, loading a new model with an older version of XGB might result in errors (C API returning -1
). OOM is a bit more tricky on lazy OS like Linux based OSs, the exception might not be caught by XGB. But I guess it's a general issue for all packages with CPP.
As we haven't received any further comments about these proposals, guess it's reasonable to start with a plan and checklist of things to do based on the current proposal.
As I see it, the following would be a high-level overview of what needs to be done:
.pyc
files which it will complain about overwriting), etc. Having something like the old makefile script to build and R package would certainly make things easier and faster.XGDMatrixSetStrFeatureInfo
or similar).DMatrix
created from a CSR matrix (currently errors out for fitting but not for predicting).DMatrix
from an R data.frame
. I guess the most efficient route here would be to call C function XGDMatrixCreateFromDT
, but I get the impression from the python interface that this is not the method that ends up getting used when constructing from a pandas DataFrame
so want to ask if this is the recommended way for a data structure like R's (basically a list of arrays which are not in one single contiguous memory block).DMatrix
object from arrow
objects. Not sure how feasible this is though, and I'd say this is very low priority.QuantileDMatrix
objects from R, accepting the same kinds of inputs as DMatrix
(data.frame
, matrix
, dgCMatrix
, dgRMatrix
, arrow
if implemented, maybe float32
). I'd also consider this one very low priority though, and it's probably quite a lot more code than the DMatrix
-related tasks.DMatrix
creation function for R matrices towards the C function that uses array_interface
.
cv
function could also be switched to use this same interface but with a pointer offset for matrices instead of using DMatrix
slice methods. I got from the previous messages that you guys wanted to remove the DMatrix slice methods, so if this R usage form is the only thing keeping them, perhaps they could also be eliminated afterwards.~ Actually this is wrong, given that the indexes will not always reflect memory-contiguous slices of a single shuffle.predict
function for the current booster to use inplace predict or other more efficient DMatrix creators (like XGDMatrixCreateFromDT
mentioned earlier) when appropriate.Booster.handle
class, as well as the conversion methods from handle to booster and vice-versa, leaving only the booster for now.xgb.Booster.complete
(which wouldn't be needed anymore).
DMatrix
objects too, which I guess is just a matter of calling XGDMatrixSaveBinary
, but I want to ask if that's all it takes or whether it needs to e.g. keep the original R data objects (like matrix
) and restore pointer addresses to that data or similar, as I see that the python class DMatrix
doesn't have __getstate__
/__setstate__
.xgboost()
function, and remove the calls from all the places it gets used (tests, examples, vignettes, etc.).data.frame
and categorical features is added, then create a new xgboost()
function from scratch that wouldn't share any code base with the current function named like that, ideally working as a higher-level wrapper over DMatrix
+ xgb.train
but implementing the kind of idiomatic R interface (x/y only, no formula) described in this thread, either with a separate function for the parameters or everything being passed in the main function.
xgb.train
(perhaps the class could be named xgboost
).predict
method, again with a different interface than the booster's predict, as described in the first message here.xgboost()
x/y interface gets implemented, then modify other functions to accept these objects - e.g.:
xgboost()
function.cv
function for the new xgboost()
interface. I'd say this is low priority though.Other topics from the earlier issue which aren't mentioned in the list above:
DataIter
.xgboost
and I'd also consider it a lower priority thhan the others.Would like to hear some comments about it, particularly around DMatrix
-related topics, and would be very helpful if xgboost maintainers could work on the first issue in this list.
After that, guess we can create a roadmap issue to keep track of progress and discuss about plans for working on these issues from different people. I think there's a github functionality for it but I've never used it and don't know how to so perhaps maintainers could help here.
Thank you for writing a comprehensive list! I'm really excited for the proposed changes here. Let me try to comment on some of the items considered to be high priority:
Create an easy-to-use script to create an R package or to otherwise re-arrange files
One can run R CMD install .
under the package directory, devtools can load the package from path as well. The script is not necessary for installing from source. I think it's possible to suppress the Python compilation. Lastly, if we were to rewrite it due to the use of Python, it would be great if we can use R instead of make file or shell.
Parameters for categorical features
If memory serves, I think the call to set types is already implemented in the current code base. We need additional support for recognizing factor automatically. I can double check later and share how to set the types.
Fitting a model from a DMatrix created from a CSR matrix
I will debug it, do you have a specific example?
Add a mechanism to create a DMatrix from an R data.frame.
I will work on it, but would like to have some feedbacks from you as well. Like is there any difference in the underlying arrays between table and frame?
After that, guess we can create a roadmap issue to keep track of progress and discuss about plans for working on these issues from different people.
Feel free to pick whichever that fits your workflow best. One can use Todo items, multiple issues, or GitHub project to track progress. Tips: this is an to-do item
I think there's a button in GitHub to split an issue with multiple items into multiple issues. I can help create a GitHub classical project to keep track of issues if needed. Feel free to ping me.
One can run R CMD install . under the package directory, devtools can load the package from path as well.
Thanks for the hint. Can confirm that R CMD INSTALL .
from R-package/
works as expected under the new build system, as do RStudio's analog which uses devtools
and the variants for running tests and building docs.
I will debug it, do you have a specific example?
Actually, this seems to have been already fixed when creating a DMatrix
first, but this will still error out:
library(xgboost)
library(Matrix)
data("mtcars")
y <- mtcars$mpg
x <- mtcars[, -1]
x.sp <- as(as.matrix(x), "RsparseMatrix")
model <- xgboost(data=x.sp, label=y)
Nevertheless, since we want to remove the code behind this function, I guess it's not worth fixing at this point.
I will work on it, but would like to have some feedbacks from you as well. Like is there any difference in the underlying arrays between table and frame?
There is no difference in the arrays, both are just a list (an R array object which in C has internal type VECSXP
and whose elements are retrieved through VECTOR_ELT
) containing other SEXP objects whose pointers are retrievable through e.g. REAL(x)
, INTEGER(x)
, etc. Note that there's also a popular package bit64
that provides int64_t
arrays.
Objects from data.table
do have some differences though in that they can contain lists of lists and other types, but those are anyway not supported by xgboost; and in other small details which shouldn't matter for DMatrix
such as potentially having hashes for some column or not having row names.
In terms of creating DMatrix
from data.frame
, I guess the most reasonable route would be to only accept columns of types numeric
+ integer
+ factor
+ bit64::integer64
, leaving potentially character
for the high-level interface. Note that float::float32
cannot be put inside a data.frame
, unlike bit64
.
I will create a separate roadmap issue with checklists to discuss work on these issues and progress.
A couple things I am wondering:
A. xgboost supports the sorted indices splitting algorithm (tree_method="exact"
).
Under this algorithm, infinite values should simply be treated as greater or smaller than every other value, and since the raw values aren't used in any other statistic, it should be possible to fit a model with them.
But it seems that xgboost will not accept infinite values regardless in the DMatrix construction. Are infinite non-missing values never accepted in input features?
B. I see that there are functions like XGBoosterPredictFromDMatrix
for which part of the output is a pointer to an array of values. In the R interface, it doesn't look like the pointer that gets passed and filled inside the call would be freed after.
Does that kind of function allocate arrays inside of the booster pointer? If so, does it mean that predicting on larger inputs makes the model heavier in memory?
C. I see there's some parallelization in the R wrapper for things like converting signed integers to unsigned.
Part of the current operations include retrieving a pointer to the data in every iteration of the loop, but if that part were to be removed by pre-retrieving the pointer, would that parallelization actually result in faster operations?
As far as I understand, for a regular amd64 platform, that sort of operation should be so fast that it's basically limited by the speed at which memory is retrieved from RAM, and having multiple threads retrieving from RAM shouldn't make it faster.
But it seems that xgboost will not accept infinite values regardless in the DMatrix construction
I disabled the inf input due to lack of tests and other algorithms running into issues, to properly support it we need to have test cases for all potential uses, including things like SHAP, categorical features, etc. At the moment, it's not a very high priority. A use case would help.
it doesn't look like the pointer that gets passed and filled inside the call would be freed after
It's using thread-local static memory as a return buffer for thread-safe inference. The system will free it.
static thread_local int myvar;
As far as I understand, for a regular amd64 platform, that sort of operation should be so fast
I agree. Feel free to remove them.
but if that part were to be removed by pre-retrieving the pointer
I think we can entirely remove the data copies by using array interface, which is also supported by the SetInfo
. As a bonus, it saves memory usage as well thanks to being able to avoid allocating a new buffer. Memory usage has been a huge headache since some users are quite used to deep-learning-like batch training and find it frustrating that gradient boosting needs the entire dataset in memory to be efficient. Most of the type-specific C API are legacy now.
Another question: are there any guidelines for using SIMD code in xgboost?
I see for example there's a function in the R wrapper that substracts 1 from elements in an array: https://github.com/dmlc/xgboost/blob/95af5c074be363bddc439fa3d1c1460a29cc818e/R-package/src/xgboost_R.cc#L286
I was thinking of adding #pragma omp simd
, but was wondering if perhaps xgboost includes some SIMD module with dynamic dispatching per CPU instruction set or similar.
No, we don't have those. Feel free to add it.
I thought about having an aligned allocator, but dropped the work as my attempt of using sse did not speedup the histogram kernel. Another attempt of using it is to speedup allreduce on CPU, the difference was quite insignificant from my benchmark, so it was dropped again.
HI @david-cortes , are you familiar with the error from CRAN check caused by calling error/Rf_error
:
xgboost_R.cc:70:3: warning: format string is not a string literal (potentially insecure) [-Wformat-security]
I'm not sure how to fix it without removing the use of the error
function.
https://cran.r-project.org/web/checks/check_results_xgboost.html
Update
I will try Rf_error("%s", msg)
with rhub.
Rf_error("%s", msg)
worked for us over in LightGBM when faced with the same warning:
https://github.com/microsoft/LightGBM/pull/6216
We have a CI job there that replicates the Debian clang r-devel build from CRAN.
Yes, that seems to be the usual clang formatting warning, which is solved by using "%s", <str>
instead of just <str>
. That %s
is also the solution that's being proposed at Rcpp: https://github.com/RcppCore/Rcpp/issues/1287
@trivialfis In the previous thread, you mentioned this as one of the desired things to implement for the R interface:
quantile regression with multiple quantiles.
Is this supported in the python interface? Looks like it errors out if you try to pass multiple quantiles.
If this is supported in python, could you provide an example of what this means?
Another question: does xgboost support multi-output survival objectives? Looks like DMatrix
won't take multi-column inputs in the C interface for bounds.
I understand multi-CoxPH would not be a trivial thing to do given the interdependencies between rows, and AFT would require a non-trivial multivariate Newton solver for the global parameters with interdependencies between columns (should be quite feasible though), but what if the user were to provide a custom survival objective that works with multiple columns?
If this is supported in python, could you provide an example of what this means?
The example in the guide-python
should work.
As for future support of survival training with built-in objective, for the old cox regression, it should not be too difficult. But for the set of AFT objectives with lower and upper bounds, I need to take a deeper look before answering.
As for custom objective, XGB supports custom objective with matrix gradient (along with the usual vector), as long as you can formulate your problem to return the right shape of gradient, XGBoost is happy to ignore what's in the label.
Before we figure out what to do with the CRAN package, do you think it's appropriate to submit XGBoost to r-universe so that more people can access the new interface with iterative development?
Before we figure out what to do with the CRAN package, do you think it's appropriate to submit XGBoost to r-universe so that more people can access the new interface with iterative development?
Since it's a work in progress that's seeing continuous updates and hasn't yet gotten the major ones, I don't think it makes sense yet to release to R universe. I am guessing that people who want the latest would anyway go to github first.
Couple other questions:
base_score
? I see in the case of multi-output regression it uses zero instead of per-y-column means, and doesn't take arrays/lists when using a custom objective.Yet another question: is there a way to serialize a DMatrix without creating a file? I see that there's a function XGDMatrixSaveBinary
, but I cannot find any equivalent for saving that to char
bytes.
Would be useful to have such a function as then both the R and Python interfaces would be able to serialize such objects (which is helpful for example when one wishes to save an R session to resume it later, for example).
I see the python interface currently would throw this error:
ctypes objects containing pointers cannot be pickled
so it could also help from having __getstate__
/ __setstate__
methods.
Since it's a work in progress that's seeing continuous updates and hasn't yet gotten the major ones
Sounds good. We can discuss it then. Thank you for the continuous update so far! Looks there's new life in the R package. ;-)
Is there some scenario where xgboost would accept a vector
No. Not until we can remove the old binary serialization format. https://github.com/dmlc/xgboost/issues/7547
Is there some scenario where an array with more than 2 dimensions
Not yet. We don't have a use case yet. Maybe in the future, we will have multi-target-multi-quantile quantile regression, or some multi-target forecasting with multi-parameter distributions.
is there a way to serialize a DMatrix without creating a file?
No, is it necessary to serialize the DMatrix? Can we recreate it from source data when needed? As previously discussed, we would like to keep the scope of this class as small as possible.
I see the python interface currently would throw this error:
So far I deliberately keep it this way so that Python doesn't secretly serialize it, which can cause troubles in distributed environments when the framework is throwing data around workers.
I can certainly see how adding serialization to DMatrix
could potentially lead to bad behaviors where the data is de-serialized twice or similar, but at least in R, there are some cases where this might be desirable even if it comes with a performance impact:
.RData
to resume a session later.Now, the user can always just serialize the data source and re-create the DMatrix, and that's most likely going to be the most optimal route, but could be a surprise to a user who isn't familiar with xgboost.
Also, if I understand it correctly, if QuantileDMatrix
were to be implemented, and a user wishes to share these between processes in a parallel backend, wouldn't serialization of that QuantileDMatrix
be faster than serializing the whole data as-is?
but could be a surprise to a user who isn't familiar with xgboost.
If the R package is transitioning into a high-level interface, the user is unlikely to interact with DMatrix right? The class is more or less an implementation detail instead of a user-controlled object. For instance, the sklearn
interface in XGBoost dispatch between DMatrix
and QuantileDMatrix
internally inside the fit
method without users knowing the class.
That's a good point. Let's then better leave it without char
serialization to avoid inefficiencies.
One question, on Python, we are quite sensitive to memory usage since users expect libraries to work with huge data inputs, and so we have done much work to reduce the memory usage, sometimes willing to sacrifice the performance a little bit. Both the QuantileDMatrix and inplace_predict are for memory usage reduction. However, on R, I think many users (I might be wrong) have different priorities. From the introductory books I read about R, memory usage and performance are of lesser concern, and the presentation of mathematical objects is much more important. For instance, the tidyverse proudly works with small data. As a result, one doesn't need to pick all these optimizations from Python, or doesn't need to have them in the first couple of iterations. I'm curious about your thoughts @david-cortes @jameslamb .
I guess you are right that RStudio-developed packages tend not to be geared towards computational efficiency, and the average R user is probably not dealing with large datasets either, but there's still subsets of users striving to develop computationally efficient R code for large data as evidenced by the popularity of R packages like DuckDB and arrow.
Even in base R, things have been moving towards memory efficiency lately through e.g. the ALTREP system which is further allowing packages like arrow to avoid memory copies and such.
That being said, I would consider memory efficiency in the R package to be a much lower priority than features that affect the results of a model.
I completely agree with @david-cortes 's summary.
In situations where you're forced to choose between memory efficiency and usability (i.e. compatibility with other R libraries or base R), I think it's totally defensible to choose usability.
There are two methods I found useful for the DMatrix
class that are not yet exposed in R:
XGDMatrixNumNonMissing
: Get the number of non-missing values.XGDMatrixGetDataAsCSR
: Return the DMatrix as a CSR matrix.
They are convenient for testing purposes. Do you think we should expose it given that we are transitioning into a high-level interface?There are two methods I found useful for the
DMatrix
class that are not yet exposed in R:
XGDMatrixNumNonMissing
: Get the number of non-missing values.XGDMatrixGetDataAsCSR
: Return the DMatrix as a CSR matrix. They are convenient for testing purposes. Do you think we should expose it given that we are transitioning into a high-level interface?
Sure, it looks like they should be quite easy to implement.
Got it, will add them to the roadmap and work on it.
I'm having a hard time wrapping my head around how QuantileDMatrix creation works and what is or isn't required to pass to them.
Would it be possible to have C functions to create these QuantileDMatrices from array_interface strings for dense/csr data sources + required parameters? Is there some limitation that would make such an approach not work?
Let's continue the discussion around QDM in https://github.com/dmlc/xgboost/pull/9864 .
Question: when using categorical features, would the DMatrix passed to C prediction functions need to have feature_types
set on it? What would happen if it is set in the prediction DMatrix, but these feature types differ from the ones that were used for training?
Yet another question: is there something currently preventing using field feature_type
in the R interface?
Something like this seems to work right now, but I don't know if the feature types are actually getting used:
library(xgboost)
data(iris)
y <- iris$Sepal.Length
x <- iris[, -1]
x$Species <- as.numeric(x$Species)
x <- as.matrix(x)
dm <- xgb.DMatrix(data=x, label=y)
setinfo(dm, "feature_type", c("q", "q", "q", "c"))
model <- xgb.train(
data=dm,
nrounds=5
)
pred_mat <- predict(model, dm)
A couple perhaps stupid questions that I'm wondering after issue https://github.com/dmlc/xgboost/issues/9869 and which I would like to somehow add to the docs:
base_score
used? Is this something that gets added to all the rows in what predict
calls the "base_margin"? Or does it undergo some other process? Is there documentation around it?base_score
for objectives like binary:logistic
? I had the impression that it would take the value that minimizes the objective function when the prediction is constant, like lightgbm
does (so e.g. for logistic, this would be log(sum(y==1)) - log(sum(y==0))
), but from a quick test with the agaricus data, this doesn't seem to be the case.
base_score
to the mean of label
even for objective binary:logistic
, while for smaller datasets, it seems to be close to the mean of label
, but still off by some amount.base_score
(or otherwise the intercept if this isn't it) get updated with each tree that gets added, or is it always set to what was estimated initially?
Let me add documentation instead of writing with GH comments, always nice to have fresh eyes to catch the lack of documents.
It seems CMake is packed into rtools43 for windows? I found this while wrangling with MSVC:
C:\rtools43\usr\lib\mxe\usr\x86_64-w64-mingw32.static.posix\bin\cmake.exe
I'm not sure if RTools itself bundles CMake, but it should be possible to have it as a dependency under 'SystemRequirements' and then CRAN will ensure that it is available when it produces its pre-compiled binaries.
There's a section about CMake in the R extensions manual, but I'm not sure what's the right way to get an R package to use CMake: https://cran.r-project.org/doc/manuals/R-exts.html#Using-cmake
Thank you for sharing! I appreciate the reference. As long as CRAN is confortable with the use of CMake, I'm sure there's a way to sort out all the technical details.
ref https://github.com/dmlc/xgboost/issues/9475
I'm opening this issue in order to propose a high-level overview of how an idiomatic R interface for
xgboost
should ideally look like and some thoughts on how it might be implemented (note: I'm not very familiar with xgboost internals so have several questions). Would be ideal to hear comments from other people regarding these ideas. I'm not aware of everything thatxgboost
can do so perhaps I might be missing something important here.(Perhaps a GitHub issue is not the most convenient format for this though, but I don't where else to put it)
Looking at the python and C interfaces, I see that there's a low-level interface with objects like
Booster
with associated functions/methodsxgb.train
that more or less reflects how the C interface works, and then there's a higher-level scikit-learn interface that wraps the lower-level functions into a friendlier interface that works more or less the same way as scikit-learn objects, which is the most common ML framework for python.In the current R interface, there's to some degree a division into lower/higher-level interfaces with
xgb.train()
andxgboost()
, but there's less of a separation as e.g. both return objects of the same class, and the high-level interface is not very high-level as it doesn't take the same kinds of arguments as base R'sstats::glm()
or other popular R packages - e.g. it doesn't take formulas, nor data frames.In my opinion, a more ideal R interface could consist of a mixture of a low-level interface reflecting the way the underling C API works (which ideally most reverse dependencies and very latency-sensitive applications should be using) just like the current python interface, plus a high-level interface that would make it ergonomic to use at the cost of some memory and performance overhead, behaving similarly as core and popular R packages for statistical modeling. Especially important in this higher-level interface would be to make use of all the rich metadata that R objects have (e.g. column names, levels of factor/categorical variables, specialized classes like
survival::Surv
, etc.), for both inputs and outputs.In the python scikit-learn interface, there are different classes depending on the task (regression / classification / ranking) and the algorithm mode (boosting / random forest). This is rather uncommon in R packages - for example,
randomForest()
will switch between regression and classification according to the type of the response variable - and I think it could be better to keep everything in the high-level interface under a single functionxgboost()
and single class returned from that function, at the expense of more arguments and more internal metadata in the returned objects.In my view, a good separation between the two would involve:
* Having the low-level interface work **only** with `DMatrix` (which could be created from different sources like files, in-memory objects, etc.), and the high-level interface work **only** with common R objects (e.g. `data.frame`, `matrix`, `dgCMatrix`, `dgRMatrix`, etc.), but **not** with `DMatrix` (IMO would not only make the codebase and docs simpler but also given that both have different metadata, would make it easier to guess what to expect as outputs from other functions). * Having separate classes for the outputs returned from both interfaces, with different associated methods and method signatures, so that e.g. the `predict` function for the object returned by `xgb.train()` would mimic the prediction function from the C interface and be mindful of details like not automatically transposing row-major outputs, while the `predict` function from the object returned from `xgboost()` would mimic base R and popular R packages and be able to e.g. return named factors for classification objectives. * The plot functions perhaps could be shared, with some extra arguments being required for the low-level interface but forbidden for the high-level interface. * Having separate associated `cv` functions for the low-level and the high-level interfaces (e.g. `xgboost.cv()` or `cv.xgboost()`, like there is `cv.glmnet()`; and another `xgb.cv()` that would work with `DMatrix`). * Keeping extra metadata inside the object returned by the high-level interface, such as column names, types, factor levels, the actual string of the function call (the way base R does), etc. * Exporting from the package only what's necessary to support these two modes of operation alongside with their methods - e.g. not exporting internal functions like `xgboost::normalize` or internal classes/methods like `Booster.handle`.A couple extra details **about the high-level interface** that in my opinion would make it nicer to use. I think many people might disagree with some of the points here though so would like to hear opinions on these ideas:
* It should have an x/y interface, and perhaps also a formula interface (details on the latter to follow). * The x/y interface should accept `x` in the format of common R classes like `data.frame`, `matrix`, `dgCMatrix`, perhaps `arrow` (although I wouldn't consider it a priority), and perhaps other classes like `float32` (from the `float` package). I don't think it'd be a good idea to support more obscure potential classes like `dist` that aren't typically used in data pipelines. * If I'm not mistaken, CSR is supported for prediction but not for training, so `predict` should additionally take `dgRMatrix` and `sparseVector` like it does currently. * I am not into GPU computing, nor into distributed computing, so perhaps I might be missing important classes here. * If supporting Spark in R, I guess the best way would be to have a separate package dedicated to it. * (**Controversial**) It should not only accept but also **require** different classes for `y` depending on the objective, and by default, if not passed, should select an objective based on the type of y: * `factor` with 2 levels -> `binary:logistic`. * `factor` with >2 levels -> `multi:softmax`. * `survival::Surv` -> `survival:aft` (with normal link), or maybe `survival:cox` if only right-censored. * others -> `reg:squarederror`. * And consequently, require `factor` types for classification, `Surv` for survival, and numeric/integer for regression (taking `logical` (boolean) as 0/1 numeric). * If multi-column response/target variables are to be allowed, they should be required to be 2-dimensional (i.e. no vector recycling), as either `data.frame` or `matrix`, and keep their column names as metadata if they had. * If `x` is a `data.frame`, it should automatically recognize `factor` columns as categorical types, and (**controversial**) also take `character` (string) class as categorical, converting them to `factor` on the fly, just like R's formula interface for GLMs does. Note that not all popular R packages doing the former would do the latter. * (**Controversial**) It should only support categorical columns in the form of an input `x` being a `data.frame` with `factor`/`character` types. Note: see question below about sparse types and categorical types. * If `x` is either a `data.frame` or otherwise has column names, then arguments that reference columns in the data should be able to reference them by name. * For example, if `x` has 4 columns `[c1, c2, c3, c4]` it should allow passing `monotone_constraints` as either `c(-1, 0, 0, 1)`, or as `c("c1" = -1, "c4" = 1)`, or as `list(c1 = -1, c4 = 1)`; but not as e.g. `c("1" = -1, "4" = -1)`. * OTOH if `x` is a `matrix`/`dgCMatrix` without column names, it should accept them as `c(-1, 0, 0, 1)`, or as `c("1" = -1, "4" = -1)`, erroring out on non-integer names, not allowing negative numbers, and not allowing `list` with lengthThe low-level interface in turn should support everything that the C interface offers - e.g. creating
DMatrix
from libsvm files, specifying categorical columns from a non-data-frame array, etc.As per the formula interface, this is quite tricky to implement in a good way for xgboost.
In the case of linear models, it's quite handy to create these features on-the-fly to find out good transformations and e.g. be able to call stepwise feature selectors on the result, but: * Stepwise selection from base R doesn't work with xgboost. * Transformations like `log(x1)` have no effect in decision trees, transformations like `x1:x2` don't have the same practical effect as decision trees implicitly create interactions, and transformations like `x^2` the way R does them do not make a difference for decision trees compared to simpler `I(x^2)+x`, for example. * Other outputs from formulas like contrasts are not used either way. * Transformations like `log(y)` aren't any harder to do with an x/y interface than with a formula interface. Nevertheless, a formula interface can still be handy for calls like `y ~ x1 + x2 + x3` when there are many columns, and packages like `ranger` offer such an interface, so perhaps it might be worth having one, even if it's not the recommended way to use. Some nuances in terms of formulas though: * Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept, and has no use for a column named `(Intercept)` that would have the same value for every row. * Base R's formulas will either do one-hot or dummy encoding for factor/character columns according to whether there is an intercept or not, while xgboost now has native support for categorical columns. * Using a `formula` for processing the training data also implies using it for `predict` - so for example, formulas do not recode levels of factor variables when calling `predict`, which the x/y interface could potentially do, leading to differences in behaviors between both interfaces. * Some packages have their own formula parsers which allow additional interpretations for something like `y ~ x | z` than what base R would do (for example, `lme4` would interpret `z` here as something that controls mixed effects for `x`, while base R would interpret this as a feature "x or z"), and in some cases `xgboost()` would also need a different interpretations of formulas (e.g. for parameter `qid`, which doesn't fit at either side of the formula). * Packages like `randomForest` don't use the base R formula parser, taking it instead (by copying the code) from a different library `e1071` which is GPLv2 licensed, which is incompatible with xgboost's apache license. * Refering to columns that are the result of a formula by index in e.g. `monotone_constraints` could be tricky - e.g. if we remove the auto-added `(Intercept)` columns, should the numbers re-adjust? Hence, supporting a formula interface for `xgboost()` would be tricky: * It could be possible to use base R's formula parser, but: * In this case it would not be possible to use categorical features (or their interactions) as such, as they will be either dummy- or one-hot encoded. * It theoretically could be possible to try to forcibly try to add `-1` at the end of the formula (which means "don't add an intercept") by converting it to string and back, in order to get one-hot encoding of factors and avoid adding `(Intercept)`, but I can foresee cases in which this might not work depending on how the formula is inputted. * It would not be possible to specify additional columns like `qid` in these formulas. * It could be possible to develop a custom formula parser that would not one-hot-encode categorical features and e.g. interpret `|` differently, but what should happen in this case if the user requests something like `xnum*xcat` or `f(xcat)` (with `f` being some arbitrary base function) ? Unless we can find some other package that would better handle formula parsing and that could be reasonable to use as dependency (I'm not aware of any), I think the best way forward here would be to: * Use base R's formula parser. * Don't support parameters like `monotone_constraints` or `qid` in the formula interface. * Don't support native categorical columns, relying instead on the one-hot or dummy-encoding from the formula. * Try to remove the intercept by trying to create a new formula `A couple questions for xgboost developers (@hcho3 @trivialfis ?) I have here:
* Does `xgboost` support passing a mixture of dense and sparse data together? `pandas` supports sparse types, but if I understand it correctly from a brief look at the source code of the python interface, it will cast them to dense before creating the `DMatrix` object. If it supports mixtures of both, I think it'd again be ideal if the R interface could also have a way to create such a `DMatrix` for the low-level interface. Not sure if there'd be an idiomatic way to incorporate this in the high-level interface though, unless allowing passing something like a `list`. * Is there a C-level `DMatrix` equivalent of `cbind` / `np.c_` ? I don't see any but this would make things easier. * Can sparse columns be taken as categorical without being densified? If so, I guess specifying categorical columns of a sparse matrix should also be supported in the high-level R interface. * In other interfaces, does passing a data frame of different types always involve putting everything into a contiguous array of the same type? I see there's a C function `XGDMatrixCreateFromDT`, but if I understand it correctly from a look at the pandas processing functions in the scikit-learn interface, if the input involves types like `int64`, these would be casted to a floating-point type. In R, especially when using `data.table`, it's oftentimes common to have 64-bit integers from the package `bit64`, which if casted to `float64`, would lose integral precision beyond $2^52$ (same for `int32` which could lose precision when casted to `float32`) - I am wondering if there should be a way to support these without loss of precision, and whether it's possible to efficiently create a `DMatrix` from a structure like an R `data.frame`, which is a list of arrays that aren't contiguous in memory and might have different dtypes. * Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification? As it is right now, it'd be quite easy to do this in R, but I can foresee a situation in which in the future someone for example submits a PR adding a new objective like `binary:cauchit`, adds it to the python `XGBClassifier` class, but overlooks the R code as it might be unknown to the contributor, and the R interface then won't act properly on receiving this objective. * How does the python interface keep the list of parameters up-to-date with the C interface? Is it done manually by maintainers? Is there some way in which the R interface could automatically keep in synch with the C one every time a new parameter is added? Or will it need to manually keep recollection of every possible allowed argument and its default value? * How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing? I see there's also a Julia interface for example, which just like R, uses base1 indexing, but from a brief look at the docs it looks like it sticks with base0 indexing for `xgboost` regardless. * Since the R object from the high-level interface in this case would have more metadata than what's kept in the serialized C `Booster`, this in theory could lead to issues when updating package versions and loading objects from an earlier version - for example, if a new field is added to the object class returned from `xgboost()`, an earlier object saved with `saveRDS` will not have such a field, which might lead to issues if it is assumed to exist. It'd be theoretically possible to add some function to auto-fill newly added fields as they are added to the R interface when e.g. restoring the booster handle, but this could potentially translate into a lot of maintenance work and be quite hard to test amnd easy to miss when adding new features. * How does the python scikit-learn interface deal with `pickle` and object attributes that aren't part of the C `Booster`? How does it deal with converting between `Booster` and scikit-learn classes? * I see the R interface currently has a big notice about objects serialized with non-xgboost functions like `saveRDS` not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C `Booster`, but there could be cases in which the R interface changes independently of the C struct. * If one were to load a C `Booster` into the high-level interface and there's no metadata to take, I guess the most logical thing would be to fill with generic names "V1..N", "1..N", etc. like base R does, but this would not lead to a nice inter-op with the python scikit-learn interface. Does the python interface or other interfaces keep extra metadata that the `Booster` wouldn't? Is it somehow standardized? * How does `xgboost` determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it? * Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a `DMatrix`? If so, I guess support for `arrow` in R could be delayed until such a route is implemented. I guess the python interface would also benefit from such a functionality being available at the C level.Some other questions for R users (@mayer79 ?):
* Are there any popular data.frame-like objects from packages related to GPU computing (like `cuDF` for python) or distributed computing (other than spark) that I might perhaps be missing in this post? * Is there any popular R type for learning-to-rank tasks that would involve the `qid` and position information that `xgboost` uses? I am not aware of any but I'm no expert in this area. * Is there perhaps some object class from Bioconductor or from some other domain (like, don't know, geopositioning from e.g. gdal) that should logically map to some specific xgboost objective without undergoing transformation by the user to e.g. `factor` / `numeric` / etc. ? * Is there some package offering a custom formula parser meant for decision tree models?