dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.12k stars 8.7k forks source link

Distributed training discrepancies with `hist` tree method #8476

Closed dthiagarajan closed 1 year ago

dthiagarajan commented 1 year ago

Hello,

I'm trying to run regression with distributed XGBoost, but I'm noticing that the results are different when I use distribution vs. when I train with a single process.

I used a custom objective function so I could verify that the gradients and hessians (local to each worker) I use match with what's used in single-process training. However, even after verifying that the gradients and hessians match, I find that the performance is not the same between distributed training and single-process training, and I'm noticing that in my experiments, the performance for distributed training is consistently worse.

I'd like to understand why this is the case. Is this something that is particular to the hist algorithm? Specifically:

  1. How are the bounds of each bin computed? Are these based on feature statistics, or gradient statistics? And is it possible that the quantiles on each worker differ, resulting in the consolidated histogram being different for distributed training vs. single-process training?
  2. How are weights incorporated if sample_weight is specified?
trivialfis commented 1 year ago

It's expected. Quantiles between workers are merged based on gk sketching, which is an approximation algorithm instead of computing the exact quantile. Also, due to floating point errors, the histogram might be different.

How are the bounds of each bin computed? Are these based on feature statistics, or gradient statistics

If you mean the boundary of each bin, it's based on the feature values, the quantile of features.

And is it possible that the quantiles on each worker differ,

No, quantiles are the same for all workers. But it can be different if there are a different number of workers, with the same data, using one worker might generate different quantiles than using two workers.

How are weights incorporated if sample_weight is specified?

In many places, gradient calculation, quantile computation, and evaluation metrics.

dthiagarajan commented 1 year ago

Got it, thanks! Is there anywhere I can read through how the gk sketching you mentioned works?

Also, is it possible to switch a flag on/off to compute the exact quantile?

trivialfis commented 1 year ago

I think you can find references in xgboost's paper along with related proofs.

We don't use exact quantile. In a distributed environment, sorting the data is just not practical for machine learning.

dthiagarajan commented 1 year ago

Sorry.. I'm still a bit confused - if I'm using the same dataset, why would the quantiles be different based on the number of workers? I understand that each worker has it's own slice of the dataset, but if they're being merged to determine the quantiles, shouldn't the number of workers not matter?

trivialfis commented 1 year ago

The merge operation is an approximate algorithm, different number of workers requires different number of merges, generating different approximation result