Avoiding having to manually specify the comparison threshold between a sharded implementation and its reference.

When working on #2905, I had to manually set the threshold to 0.01, which seemed to be good enough up to 6 GPUs. testValidate doesn't work out of the box.

Here are several reasons that I'm aware of:

bfloat led to a larger error than half for the particular MLP_Layer test. This is probably also true in general because bfloat has fewer mantissa bits. testValidate however uses the same inner product threshold for bfloat as half.
ncclAllReduce is found to sum along the device dimension with low precision accumulation. Due to https://github.com/NVIDIA/Fuser/issues/2904, this is not factored in when the reduction result is later casted to higher precision and used by a fusion output. This problem is worse with more GPUs. 0.01 wasn't enough when I ran the test with mpirun -np 8, even though the reduction size stayed the same. As an incidental note, I tried to change the reference implementation to sum along the device dimension in bfloat. That however didn't seem to lower the max abs diff.

NVIDIA / Fuser

Avoiding having to manually specify the comparison threshold between a sharded implementation and its reference. #2906