dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.34k stars 8.73k forks source link

[Dask] Sort and partition for ranking. #11007

Open trivialfis opened 6 days ago

trivialfis commented 6 days ago

Please see the changes in the tutorial about the new features. In summary, we can now have two different ways to handle query groups in the Dask interface, depending on whether a global sort is desired.

Please note that, this PR ensures the accuracy of the model at a heavy performance price. A global sort is costly and pushing huge amount of small partitions into the QuantileDMatrix is also inefficient.

related: