Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
The current method of computing metrics (error, AUC, RMSE, etc) is not quite robust in distributed setting. Currently, the given metric is first computed in each node locally (with its own data shard) and then an average is computed via AllReduce. So
[Metric reported] = (1/n) * ( [Metric computed in partition 1]
+ [Metric computed in partition 2]
+ [Metric computed in partition 3]
+ ...
+ [Metric computed in partition n])
(Weighted average is used in practice, but for now, assume unweighted to simplify discussion.)
This approach is reasonable for error, MAE, and log likelihood. However, it is problematic for RMSE and AUC (in binary classification setting).
In general, average of RMSE of individual partitions != global RMSE of all data points, because of the square root applied to RMSE. It remains to see how bad we are off from the true RMSE.
Local estimate of AUC computed in a single partition is potentially highly noisy, since the partition might contain an unusually high/low number of the positive label (or none at all).
Solution: For each metric, we need tailored steps to obtain a robust estimate. For example:
RMSE can be computed exactly by performing AllReduce 2 times, first to compute the sum of all squared deviances, second to compute the total number of data points.
It is impractical to compute AUC exactly, as we don't want to sort the training data in distributed setting. Instead, we can generate a robust, accurate estimate of AUC using approximate quantile sketches.
Compute the prediction probabilities in each local partition.
Compute approximate quantile sketches over the computed probabilities.
Now use the quantile sketches to estimate True Positive Rate (TPR) and False Positive Rate (FPR) for each threshold candidate (usually we'll pick grid points equally spaced between 0 and 1).
The current method of computing metrics (error, AUC, RMSE, etc) is not quite robust in distributed setting. Currently, the given metric is first computed in each node locally (with its own data shard) and then an average is computed via AllReduce. So
(Weighted average is used in practice, but for now, assume unweighted to simplify discussion.)
This approach is reasonable for error, MAE, and log likelihood. However, it is problematic for RMSE and AUC (in binary classification setting).
Solution: For each metric, we need tailored steps to obtain a robust estimate. For example:
cc @yinlou @thvasilo @CodingCat