Closed p-schaefer closed 1 year ago
I just noticed my dtest object wasn't generated by xgboostlss.py$model$xgb$DMatrix(), so that solved the first issue I was having, but I'm still not sure about the second.
@p-schaefer Thanks for your interest in the project.
In its current code-version, the publicly available XGBoostLSS does not support multi-target regression. It is still developmental. I also highlight this in the news section that the support is released soon.
The main reason is that the base XGBoost model is designed for single-target regression tasks. While efficient for low to medium multivariate target-dimensions as shows in the multi-target XGBoostLSSS paper, the computational cost of estimation becomes prohibitive in high-dimensional settings. As an example, consider modelling a multivariate Gaussian distribution with D=100
target variables, where the covariance matrix is approximated using the Cholesky-Decomposition. Modelling all conditional moments (i.e., means, standard-deviations and all pairwise correlations) requires estimation of D(D + 3)/2 = 5,150
parameters. Because XGBoost is based on a one vs. all estimation strategy, where a separate tree is grown for each parameter, estimating this many parameters for a large dataset can become computationally extremely expensive, if not impossible.
While I plan to release the multi-target XGBoostLSS model soon, there is a new member of the LSS model coming some time next week that is faster than LightGBM and XGBoost. So stay tuned.
Hope this helps.
Thank you for the follow-up. I have a very interesting project where I believe multivariate regression would be a good solution. I am predicting the abundances of aquatic taxa (fish and benthic invertebrates) across Ontario, Canada. I believe predicting the entire community with a single model will be more effective that predicting each taxa, or individual community summaries. I would have around 30-40 taxa I am try to predict, so hopefully still in the medium size group. I'm eagerly awaiting the release of this feature.
Out of curiosity, what would be the difference in complexity in predicting expectiles (say 0.05, 0.50, 0.95) vs. a specific distribution?
I'm eagerly awaiting the release of this feature.
The release of the XGBoostLSS feature would probably be sometime later this month. But there is a new boosting library that I currently extend to a probabilistic setting. I am planning the first release sometime next week.
Out of curiosity, what would be the difference in complexity in predicting expectiles (say 0.05, 0.50, 0.95) vs. a specific distribution?
So currently, expectiles do not model any dependencies between targets and treat all of them independently. The complexity is driven by the number of expectiles you want to estimate. So if you estimate the [0.05, 0.5, 0.95] expectiles, there are only 3 parameters to estimate. However, please not that expectile crossing may happen, as described here https://github.com/StatMixedML/LightGBMLSS/issues/7.
In contrast, estimating a multivariate distribution is computationally more expensive, since you need to estimate all conditional moments (i.e., means, standard-deviations and all pairwise correlations). With ~40 target variables, you would need to estimate D(D + 3)/2 = 860
parameters for the multivariate Normal.
@p-schaefer The new package is out
https://github.com/StatMixedML/Py-BoostLSS
in case you want to give it a spin.
Wow, that looks great. Thank you!
Unfortunately I need some of the other distributions (specifically beta and gamma) for my response variables. Any thoughts on how long those will take to implement? And actually, the Tweedie distribution would be very helpful as well.
@p-schaefer Well, for the multivariate beta: that would be the Dirichlet distribution. I already have a base implementation for this. If that would be useful, I would need to re-shuffle some of my time to make it availlable.
For the other two: I currently wouldn't see a fast way to implement them.
Thanks for the context. Beta would be a great start.
Some quick googling does show some multivariate Tweedie distributions have been developed. -https://www.sciencedirect.com/science/article/abs/pii/S016766870900153X -https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsac159/6710216 But of course, I don't know if those would fit into your frameworks.
Thanks again for the great work!
Hey @StatMixedML , I'm really excited for this library as well and particularly in getting a multi-target probabilistic regression! I'll also try to give Py-BoostLSS a spin soon!
have you had a chance to have a look at catboost
? I think their multi-output regression is built using a single model (source)
@tim-habitat Thanks for your interest in the project.
have you had a chance to have a look at catboost ? I think their multi-output regression is built using a single model
You are right, catboost uses a single-tree approach, where a single multivariate tree is built for all parameters. I am not entirely sure, but I recall that catboost does not support GPU training of custom loss functions. Can you maybe check on this?
@p-schaefer I moved the issue to the Py-BoostLSS directory.
Thanks for the context. Beta would be a great start.
Ok I try to have the Dirichlet available soon, maybe some time early next week. Do you know of any publicly available Dirichlet dataset I can use?
Some quick googling does show some multivariate Tweedie distributions have been developed.
Nice, thanks for the links! I check them out. Having a multivariate Tweedie is for sure a great feature to have.
@StatMixedML I did some digging and it seems that GPU training is not supported when passing a custom loss function. Perhaps if the loss function is coded in C or C++ and accessing the lower level api, but not from the python api...
Not sure it would even be multi-threaded due to python
GIL, couple of people complaining about its slowness when passed custom functions.
@tim-habitat Ok nice, thanks! I am not sure then if it makes sense to use catboost for multi-target regression settings...
Ok I try to have the Dirichlet available soon, maybe some time early next week. Do you know of any publicly available Dirichlet dataset I can use?
@StatMixedML I'm not sure what your exact requirements are for a dataset, but the DirichletReg R package has a few small datasets that may be helpful.
@p-schaefer Thanks for the link. I was hoping to not use the Arctic-Lake dataset, but should be ok for an example notebook.
@p-schaefer I have added the Dirichlet distribution with an example. Maybe you can try it with your dataset. Please make sure to first re-install the package, since I have made several changes.
@p-schaefer I have added the Dirichlet distribution with an example. Maybe you can try it with your dataset. Please make sure to first re-install the package, since I have made several changes.
@StatMixedML I think I've got the package installed correctly, but I'm having trouble getting it run with reticulate:
hp_dict = list(lr= c(0.00001,1),
max_depth= c(1L,4L),
sketch_outputs= c(1L,10L),
subsample= c(0.5,1),
colsample= c(0.2,1),
lambda_l2=c(0,40),
min_gain_to_split= c(1L,1000L)
)
dtrain<-list(
X=data.frame(depth=DirichletReg::ArcticLake[,"depth"]),
y=DirichletReg::ArcticLake[,1:3]
)
distribution = pyboostlss.py$distributions$DIRICHLET(D=r_to_py(as.integer(ncol(dtrain$y))))
pyblss = pyboostlss.py$model$PyBoostLSS(distribution)
opt_param = pyblss$hyper_opt(params=r_to_py(hp_dict),
dtrain=r_to_py(dtrain),
use_hess=r_to_py(TRUE),
sketch_method=r_to_py("proj"),
hp_seed=r_to_py(123L),
ntrees=r_to_py(500L),
n_trials=r_to_py(10L),
max_minutes=r_to_py(120L),
silence=r_to_py(FALSE))
# Error in py_call_impl(callable, dots$args, dots$keywords) :
# AttributeError: 'DataFrame' object has no attribute 'dtype'
@p-schaefer seems like this is a problem related to the reticulate package. Since I am not a reticulate-expert, I am afraid I don't have a solution for this.
I suppose the error is thrown here https://github.com/StatMixedML/Py-BoostLSS/blob/7a7213f44f2533c7165c36f8e7fd79507c50dc4e/pyboostlss/distributions/MVN.py#L187
or here
not sure how reticulate translates R-DataFrames into torch-tensors.
Since PyBoostLSS is meant to inrease runtime efficiency, I wouldn't recommend calling it via r-reticulate. Maybe you want to directly use the python version.
closing this for now
I'm entirely new to python and reticulate, so please bare with me, but I think I got most steps working correctly. I am running into two issues: 1 - trying to predict from a trained model, 2 - how do you generate the DMatix for multi-target regression
So far, I have:
Any help would be greatly appreciated.