Open wujingyue opened 2 months ago
The reverse may happen too. Consider a reduce-sum of a half tensor followed by a cast to float. testValidate will try to use fp32 threshold for half accumulation, potentially causing false positives. Note that a reduce-sum in half can't be codegened (due to half addition not implemented). However, when the reduction is along the device dimension, it can be accurately lowered to communications like ncclAllReduce or ncclReduceScatter, which does half accumulation according to https://github.com/NVIDIA/nccl/issues/1026.
This can lead to false negatives because the threshold is overly relaxed.