dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.35k stars 8.74k forks source link

[Roadmap] More robust metric calculation in distributed setting #4663

Open hcho3 opened 5 years ago

hcho3 commented 5 years ago

The current method of computing metrics (error, AUC, RMSE, etc) is not quite robust in distributed setting. Currently, the given metric is first computed in each node locally (with its own data shard) and then an average is computed via AllReduce. So

[Metric reported] = (1/n) * (  [Metric computed in partition 1]
                             + [Metric computed in partition 2]
                             + [Metric computed in partition 3]
                             + ...
                             + [Metric computed in partition n])

(Weighted average is used in practice, but for now, assume unweighted to simplify discussion.)

This approach is reasonable for error, MAE, and log likelihood. However, it is problematic for RMSE and AUC (in binary classification setting).

Solution: For each metric, we need tailored steps to obtain a robust estimate. For example:

cc @yinlou @thvasilo @CodingCat

yinlou commented 5 years ago

Thanks for starting this thread!

hcho3 commented 5 years ago

@RAMitchell @trivialfis This may also affect metric calculation on multiple GPUs.

hcho3 commented 5 years ago

@ericangelokim This may be of interest to you