dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.28k stars 8.73k forks source link

RoadMap #574

Closed tqchen closed 8 years ago

tqchen commented 9 years ago

We are am happy to see the project grows stable and the community is taking over some of the hard works. We decided to move things open to github issue so people who see it can discuss and give suggestions. Of course no one can finish all these things. Free free to comment or open a issue if you like to contribute or suggest your thoughts or what you think priority should be

Tutufa commented 9 years ago

@tqchen scala wrapper would be nice, since it frows in popularity

tqchen commented 9 years ago

There is already a Java Wrapper, I suppose scala can be easily build on top of the java version

Tutufa commented 9 years ago

I always here a lot of good stuff about Yandex matrixnet, it is boosting with oblivious trees, it doesnt work with categorical features but the thing rocks on continuous one. Maybe you can implement new type of booster (-:

Far0n commented 9 years ago

I would love to see feature importance judgement based on validation data in order to find noisy features more reliable.

kirillseva commented 8 years ago

CUDA/OpenCl rewrite? :)

tqchen commented 8 years ago

GPU support was not in recent roadmap, mainly because there is no clear evidence on how things can be done. Tree construction algorithm are different from neural nets, and is harder to parallelize on GPU due to irregular memory access pattern and bottleneck in memory bandwidth. But I am open to possible proposals.

birdandbees commented 8 years ago

@tqchen can we have a spark version of xgboost

tqchen commented 8 years ago

@birdandbees XGBoost already runs on YARN, which means it can run on most cluster that has hadoop.

I would also more than happy to see a Spark integration happening, this will require integration of container start and rabit API on Spark(likely via JNI), which should be doable if someone is willing to try

birdandbees commented 8 years ago

@tqchen I see, so Spark implementation us more tied to Rabbit api, would it be possible to make xgboost part of Spark MLlib?

tqchen commented 8 years ago

xgboost is like one piece of Lego brick, where the distributed computing platforms such as spark and YARN is another Lego brick.

xgboost can be directly put on top of any other bricks as long as their interface match. In the case of xgboost, it is the minimum rabit communication API, or more low level container allocation API(which is provided by YARN). We tried to build our piece of brick to be portable, so as long as the other brick match the few interface requirement, it can be plugged into that.

I like this way because this avoids re-implementing most part the libraries, and ideally being able to port and run and benefit from all the optimizations we have in xgboost, without being constrained to certain platform types. We have done this for platforms such as Hadoop/YARN, MPI etc.

Spark was a bit harder because the "brick matching" as spark provides some higher level API and running primitives that need to be matched to rabit. I think it is doable would definitely love to see this happen

birdandbees commented 8 years ago

Ok, so what we need to do to make Spark integration (on rabbit) happen? If I want to contribute, what is the first place (source code) to look at?

tqchen commented 8 years ago

Yes, this is more of porting rabit programs to spark executors. The communication layer of rabit was an interface, that can be remapped to spark's communication, or simply use rabit's communication, but use spark as a container to run the workers

birdandbees commented 8 years ago

Thanks!

khotilov commented 8 years ago

Curiously, author of The Arborist random forest implementation claims (http://www.suiji.org/arborist) that with a version tuned to Nvidia GPUs, "preliminary spins indicate that 50x acceleration is achievable over versions tuned for multicore performance".

khotilov commented 8 years ago

It would also be useful to allow for richer representation of labels that would serve potential extensions for multilabel classification, structured prediction, multitask learning, etc. Currently, I cannot even figure out a good place for censoring data when I try to think about how a survival model could be implemented.

A good refactoring option might possibly be to store predictors and label columns together in the same matrix, and to have some interface to specify which columns contain what.

tqchen commented 8 years ago

@khotilov That seems to be readily available by customized loss, note that we can pass a closure as loss function, containing these information you mentioned

pommedeterresautee commented 8 years ago

@tqchen is there a way for performing multi label learning with current implementation and custom loss? Can't see how to do without making several binary classification...

khotilov commented 8 years ago

@tqchen In some way you are right. For models with scalar predictions, a custom loss approach is currently doable, while not ideal. But I was trying to think about what would it take to possibly implement multivariate/structural learning within xgboost framework without resorting to reduction approaches. And I though that setting up the infrastructure basics would be the first step. However, it might also be reasonable to try implementing multivariate learners coupled with a custom loss. It might become somewhat heavy-weight, since, I suppose, the loss function would need to compute a vector of gradients and a Hessian matrix for every case. Do you think it could be feasible for at least some limited dimensionality of outcomes? I remember seeing your paper on structured learning with boosting and CRF, and there was some mentioning of feasibility issues of direct implementation of gradient boosting in such setting.

tqchen commented 8 years ago

If you are mentioning about vector-trees for example(decision on variables, vector output for multi varate score). We could do that, actually the tree template is already designed to support that, but not yet readily exposed.

The interface should remain modularized, as normally we need diagnonal upper bound of hessian, except now we pass in a matrix of gradient and second order gradient(which can be represented as a vector as we do now).

khotilov commented 8 years ago

@tqchen That's encouraging! I've read through the code, and I think I see what you mean. But I'm not yet confident in my ability to undertake such task. While I understand the general idea of how multiple outcomes would influence splitting, I would need to write it out to understand how that would work in gradient boosting.

Also I see that the linear booster has support for multiple "output groups". I assume it was primarily intended for multiclass classification. It could probably be re-used with multivariate outcomes. But I think the code would need some refactoring for that. Do you have any opinion on that?

And it would probably make sense to have a separate issue for discussing multivariate outcomes in order not to pollute the RoadMap. Someone else #680 has also asked about multitask learning.

tqchen commented 8 years ago

Closing this issue and open another one for this quarter.

The main goal of code refactoring is finished. However, there are a few remaining things that I hope to address and hopefully make xgboost more exciting

qqwjq commented 8 years ago

that's really encouraging, Tianqi! Thanks for the excellent package. We are building some models in recommendation, and may need to have the xgboost model serialized/deserialized in json format, which makes it easy to transfer between different platforms. How is the current status on json dump? Thanks