Open blackgold opened 10 months ago
The default algorithm for large sizes is probably the ring algorithm which can result is significant rounding errors when we add the last elements with the total sum.
SHARP, in comparison is more like a tree algorithm which sums values or (more) equal weights, causing less deviation due to floating point rounding errors.
what is the minimum value which can indicate there is some H/W or S/W errors involved?
The value depends on many things, and basically you should rely on whether the test says "Errors : 0 OK" at the end or not.
On recent NCCL perf tests, this has been replaced by an error counter instead of an error margin, so this is no longer relevant.
I ran nccl all reduce test with and without sharp and --check=1. Wanted to check if sharp introduces any errors. I was expecting a count of number of times the expected value of reduction op doesn't match the actual value. However I see 1e-06 (sharp) vs 2e-06(no sharp). How to interpret the error column?