StatMixedML / Py-BoostLSS

An extension of Py-Boost to probabilistic modelling
Apache License 2.0
20 stars 0 forks source link

using XGBoostLSS with reticulate and multi-target regression #2

Closed p-schaefer closed 1 year ago

p-schaefer commented 1 year ago

I'm entirely new to python and reticulate, so please bare with me, but I think I got most steps working correctly. I am running into two issues: 1 - trying to predict from a trained model, 2 - how do you generate the DMatix for multi-target regression

So far, I have:

xgboostlss.py<-reticulate::import("xgboostlss")

d1<-mtcars

y_out<-matrix(rnorm(nrow(d1)*3),ncol=3)
colnames(y_out)<-c("a","b",'c')

# I can't get the label argument to take more than 1 column here...
dtrain <- xgboostlss.py$model$xgb$DMatrix(as.matrix(head(d1,25)),label=head(y_out[,1],25)) 
dtest <- xgboost::xgb.DMatrix(as.matrix(tail(d1,7))) #label=tail(y_out[,1],7)

distribution = xgboostlss.py$distributions$Expectile
reticulate::py_set_attr(distribution,"stabilize",reticulate::r_to_py("MAD"))
reticulate::py_set_attr(distribution,"expectiles",reticulate::r_to_py(c(0.05,0.5, 0.95)))

params = list(eta= 0.05,                   
              max_depth= 5L,
              gamma= 3,
              subsample= 0.5,
              colsample_bytree= 0.5,
              min_child_weight= 100L
)

xgboostlss_model = xgboostlss.py$model$xgboostlss$train(params=reticulate::r_to_py(params),
                                                        dtrain=dtrain,
                                                        dist=distribution,
                                                        num_boost_round=reticulate::r_to_py(10L))

pred_expectile = xgboostlss.py$model$xgboostlss$predict(booster=xgboostlss_model, 
                                                        dtest=dtest, 
                                                        dist=distribution,
                                                        pred_type=reticulate::r_to_py("expectiles"))

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  AttributeError: 'PyCapsule' object has no attribute 'num_row'

Any help would be greatly appreciated.

p-schaefer commented 1 year ago

I just noticed my dtest object wasn't generated by xgboostlss.py$model$xgb$DMatrix(), so that solved the first issue I was having, but I'm still not sure about the second.

StatMixedML commented 1 year ago

@p-schaefer Thanks for your interest in the project.

In its current code-version, the publicly available XGBoostLSS does not support multi-target regression. It is still developmental. I also highlight this in the news section that the support is released soon.

The main reason is that the base XGBoost model is designed for single-target regression tasks. While efficient for low to medium multivariate target-dimensions as shows in the multi-target XGBoostLSSS paper, the computational cost of estimation becomes prohibitive in high-dimensional settings. As an example, consider modelling a multivariate Gaussian distribution with D=100 target variables, where the covariance matrix is approximated using the Cholesky-Decomposition. Modelling all conditional moments (i.e., means, standard-deviations and all pairwise correlations) requires estimation of D(D + 3)/2 = 5,150 parameters. Because XGBoost is based on a one vs. all estimation strategy, where a separate tree is grown for each parameter, estimating this many parameters for a large dataset can become computationally extremely expensive, if not impossible.

While I plan to release the multi-target XGBoostLSS model soon, there is a new member of the LSS model coming some time next week that is faster than LightGBM and XGBoost. So stay tuned.

Hope this helps.

p-schaefer commented 1 year ago

Thank you for the follow-up. I have a very interesting project where I believe multivariate regression would be a good solution. I am predicting the abundances of aquatic taxa (fish and benthic invertebrates) across Ontario, Canada. I believe predicting the entire community with a single model will be more effective that predicting each taxa, or individual community summaries. I would have around 30-40 taxa I am try to predict, so hopefully still in the medium size group. I'm eagerly awaiting the release of this feature.

Out of curiosity, what would be the difference in complexity in predicting expectiles (say 0.05, 0.50, 0.95) vs. a specific distribution?

StatMixedML commented 1 year ago

I'm eagerly awaiting the release of this feature.

The release of the XGBoostLSS feature would probably be sometime later this month. But there is a new boosting library that I currently extend to a probabilistic setting. I am planning the first release sometime next week.

Out of curiosity, what would be the difference in complexity in predicting expectiles (say 0.05, 0.50, 0.95) vs. a specific distribution?

So currently, expectiles do not model any dependencies between targets and treat all of them independently. The complexity is driven by the number of expectiles you want to estimate. So if you estimate the [0.05, 0.5, 0.95] expectiles, there are only 3 parameters to estimate. However, please not that expectile crossing may happen, as described here https://github.com/StatMixedML/LightGBMLSS/issues/7.

In contrast, estimating a multivariate distribution is computationally more expensive, since you need to estimate all conditional moments (i.e., means, standard-deviations and all pairwise correlations). With ~40 target variables, you would need to estimate D(D + 3)/2 = 860 parameters for the multivariate Normal.

StatMixedML commented 1 year ago

@p-schaefer The new package is out

https://github.com/StatMixedML/Py-BoostLSS

in case you want to give it a spin.

p-schaefer commented 1 year ago

Wow, that looks great. Thank you!

Unfortunately I need some of the other distributions (specifically beta and gamma) for my response variables. Any thoughts on how long those will take to implement? And actually, the Tweedie distribution would be very helpful as well.

StatMixedML commented 1 year ago

@p-schaefer Well, for the multivariate beta: that would be the Dirichlet distribution. I already have a base implementation for this. If that would be useful, I would need to re-shuffle some of my time to make it availlable.

For the other two: I currently wouldn't see a fast way to implement them.

p-schaefer commented 1 year ago

Thanks for the context. Beta would be a great start.

Some quick googling does show some multivariate Tweedie distributions have been developed. -https://www.sciencedirect.com/science/article/abs/pii/S016766870900153X -https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsac159/6710216 But of course, I don't know if those would fit into your frameworks.

Thanks again for the great work!

tim-x-y-z commented 1 year ago

Hey @StatMixedML , I'm really excited for this library as well and particularly in getting a multi-target probabilistic regression! I'll also try to give Py-BoostLSS a spin soon! have you had a chance to have a look at catboost ? I think their multi-output regression is built using a single model (source)

StatMixedML commented 1 year ago

@tim-habitat Thanks for your interest in the project.

have you had a chance to have a look at catboost ? I think their multi-output regression is built using a single model

You are right, catboost uses a single-tree approach, where a single multivariate tree is built for all parameters. I am not entirely sure, but I recall that catboost does not support GPU training of custom loss functions. Can you maybe check on this?

StatMixedML commented 1 year ago

@p-schaefer I moved the issue to the Py-BoostLSS directory.

Thanks for the context. Beta would be a great start.

Ok I try to have the Dirichlet available soon, maybe some time early next week. Do you know of any publicly available Dirichlet dataset I can use?

Some quick googling does show some multivariate Tweedie distributions have been developed.

Nice, thanks for the links! I check them out. Having a multivariate Tweedie is for sure a great feature to have.

tim-x-y-z commented 1 year ago

@StatMixedML I did some digging and it seems that GPU training is not supported when passing a custom loss function. Perhaps if the loss function is coded in C or C++ and accessing the lower level api, but not from the python api... Not sure it would even be multi-threaded due to python GIL, couple of people complaining about its slowness when passed custom functions.

StatMixedML commented 1 year ago

@tim-habitat Ok nice, thanks! I am not sure then if it makes sense to use catboost for multi-target regression settings...

p-schaefer commented 1 year ago

Ok I try to have the Dirichlet available soon, maybe some time early next week. Do you know of any publicly available Dirichlet dataset I can use?

@StatMixedML I'm not sure what your exact requirements are for a dataset, but the DirichletReg R package has a few small datasets that may be helpful.

StatMixedML commented 1 year ago

@p-schaefer Thanks for the link. I was hoping to not use the Arctic-Lake dataset, but should be ok for an example notebook.

StatMixedML commented 1 year ago

@p-schaefer I have added the Dirichlet distribution with an example. Maybe you can try it with your dataset. Please make sure to first re-install the package, since I have made several changes.

p-schaefer commented 1 year ago

@p-schaefer I have added the Dirichlet distribution with an example. Maybe you can try it with your dataset. Please make sure to first re-install the package, since I have made several changes.

@StatMixedML I think I've got the package installed correctly, but I'm having trouble getting it run with reticulate:

hp_dict = list(lr= c(0.00001,1),                   
              max_depth= c(1L,4L),
              sketch_outputs= c(1L,10L),
              subsample= c(0.5,1),
              colsample= c(0.2,1),
              lambda_l2=c(0,40),
              min_gain_to_split= c(1L,1000L)
)

dtrain<-list(
  X=data.frame(depth=DirichletReg::ArcticLake[,"depth"]),
  y=DirichletReg::ArcticLake[,1:3]
)

distribution = pyboostlss.py$distributions$DIRICHLET(D=r_to_py(as.integer(ncol(dtrain$y))))  
pyblss = pyboostlss.py$model$PyBoostLSS(distribution)

opt_param = pyblss$hyper_opt(params=r_to_py(hp_dict),
                             dtrain=r_to_py(dtrain),
                             use_hess=r_to_py(TRUE), 
                             sketch_method=r_to_py("proj"),
                             hp_seed=r_to_py(123L),                
                             ntrees=r_to_py(500L),                
                             n_trials=r_to_py(10L),              
                             max_minutes=r_to_py(120L),           
                             silence=r_to_py(FALSE))   

# Error in py_call_impl(callable, dots$args, dots$keywords) : 
# AttributeError: 'DataFrame' object has no attribute 'dtype'
StatMixedML commented 1 year ago

@p-schaefer seems like this is a problem related to the reticulate package. Since I am not a reticulate-expert, I am afraid I don't have a solution for this.

I suppose the error is thrown here https://github.com/StatMixedML/Py-BoostLSS/blob/7a7213f44f2533c7165c36f8e7fd79507c50dc4e/pyboostlss/distributions/MVN.py#L187

or here

https://github.com/StatMixedML/Py-BoostLSS/blob/2f141e118b09e9dca9a2eaf7ec716950ae83fb01/pyboostlss/distributions/MVT.py#L196

or here https://github.com/StatMixedML/Py-BoostLSS/blob/2f141e118b09e9dca9a2eaf7ec716950ae83fb01/pyboostlss/utils.py#L20

not sure how reticulate translates R-DataFrames into torch-tensors.

Since PyBoostLSS is meant to inrease runtime efficiency, I wouldn't recommend calling it via r-reticulate. Maybe you want to directly use the python version.

StatMixedML commented 1 year ago

closing this for now