dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.38k stars 8.74k forks source link

Why is xgboost great? #5149

Closed ThyrixYang closed 4 years ago

ThyrixYang commented 4 years ago

Hi,

I want to show xgboost is strictly better than deep learning model such as MLP or CNN in some scenes. Could someone give some advice on dataset selection? The dataset would be better to be widely used in research papers.

Thank you.

trivialfis commented 4 years ago

There's no free lunch...

ThyrixYang commented 4 years ago

There's no free lunch...

That means there must be some datasets that xgboost does better than deep model, I just want them.

hcho3 commented 4 years ago

There are some domains where deep learning (neural networks) excels: computer visual, natural language processing, and reinforcement learning. These domains involves unstructured or semi-structured data (pixels, sequences, state spaces).

On the other hand, XGBoost is a good choice if you have tabular data, i.e. each feature has a well-defined meaning. Some reasons why you may want to choose XGBoost (or tree-based algorithms) over deep learning:

trivialfis commented 4 years ago

Actually tree models also excel at image segmentation and similar tasks, it's just currently XGBoost not being optimized for wide dataset.

trivialfis commented 4 years ago

See https://www.microsoft.com/en-us/research/publication/decision-forests-a-unified-framework-for-classification-regression-density-estimation-manifold-learning-and-semi-supervised-learning/ for some applications of tree models on image tasks.