Large error when using matmul in the distributed matmul test.

wujingyue commented 1 week ago

          @wujingyue - I added the MLP test with aten matmul. Note, that the tolerance is bumped up to a bit to pass validation.

Validation error in output 0 (linear1) on line 583 in file /tests/cpp/test_multidevice_matmul.cpp. Detected abs error of: 0.122498 absolute tolerance was set to 0.005 and relative tolerance set to 5e-05

Validation error in output 2 (linear2) on line 583 in file tests/cpp/test_multidevice_matmul.cpp. Detected abs error of: 4.08847 absolute tolerance was set to 2 and relative tolerance set to 0.02

Originally posted by @cowanmeg in https://github.com/NVIDIA/Fuser/issues/2360#issuecomment-2189981596

wujingyue commented 1 week ago

To reproduce the error, check out wjy/error (see 7eb2f435cfb152da29199ab5ba6eee20d73135f7 for the change) and run _bn && mpirun -np 2 bin/test_multidevice --gtest_filter=DistributedMatmulTest.MLP_Layer*.

You'll see use_aten_matmul==true leads to the following error, and use_aten_matmul==false passes within 5e-3.

Validation error in output 0 on line 583 in file /opt/pytorch/nvfuser/tests/cpp/test_multidevice_matmul.cpp.
  Detected abs error of: 0.122498
    absolute tolerance was set to 0.005
    and relative tolerance set to 5e-05

Note Detected abs error of: 0.122498 is not the max absolute error. The max is at least 4. This motivates a side feature request to print out the max absolute error instead of the first (?) one being detected.

wujingyue commented 1 week ago

cc @Priya2698

NVIDIA / Fuser

Large error when using matmul in the distributed matmul test. #2460