Closed dthiagarajan closed 1 year ago
It's expected. Quantiles between workers are merged based on gk sketching, which is an approximation algorithm instead of computing the exact quantile. Also, due to floating point errors, the histogram might be different.
How are the bounds of each bin computed? Are these based on feature statistics, or gradient statistics
If you mean the boundary of each bin, it's based on the feature values, the quantile of features.
And is it possible that the quantiles on each worker differ,
No, quantiles are the same for all workers. But it can be different if there are a different number of workers, with the same data, using one worker might generate different quantiles than using two workers.
How are weights incorporated if sample_weight is specified?
In many places, gradient calculation, quantile computation, and evaluation metrics.
Got it, thanks! Is there anywhere I can read through how the gk sketching you mentioned works?
Also, is it possible to switch a flag on/off to compute the exact quantile?
I think you can find references in xgboost's paper along with related proofs.
We don't use exact quantile. In a distributed environment, sorting the data is just not practical for machine learning.
Sorry.. I'm still a bit confused - if I'm using the same dataset, why would the quantiles be different based on the number of workers? I understand that each worker has it's own slice of the dataset, but if they're being merged to determine the quantiles, shouldn't the number of workers not matter?
The merge operation is an approximate algorithm, different number of workers requires different number of merges, generating different approximation result
Hello,
I'm trying to run regression with distributed XGBoost, but I'm noticing that the results are different when I use distribution vs. when I train with a single process.
I used a custom objective function so I could verify that the gradients and hessians (local to each worker) I use match with what's used in single-process training. However, even after verifying that the gradients and hessians match, I find that the performance is not the same between distributed training and single-process training, and I'm noticing that in my experiments, the performance for distributed training is consistently worse.
I'd like to understand why this is the case. Is this something that is particular to the
hist
algorithm? Specifically:sample_weight
is specified?