Open wujingyue opened 1 week ago
To reproduce the error, check out wjy/error
(see 7eb2f435cfb152da29199ab5ba6eee20d73135f7 for the change) and run _bn && mpirun -np 2 bin/test_multidevice --gtest_filter=DistributedMatmulTest.MLP_Layer*
.
You'll see use_aten_matmul==true
leads to the following error, and use_aten_matmul==false
passes within 5e-3.
Validation error in output 0 on line 583 in file /opt/pytorch/nvfuser/tests/cpp/test_multidevice_matmul.cpp.
Detected abs error of: 0.122498
absolute tolerance was set to 0.005
and relative tolerance set to 5e-05
Note Detected abs error of: 0.122498
is not the max absolute error. The max is at least 4. This motivates a side feature request to print out the max absolute error instead of the first (?) one being detected.
cc @Priya2698
Validation error in output 0 (linear1) on line 583 in file /tests/cpp/test_multidevice_matmul.cpp. Detected abs error of: 0.122498 absolute tolerance was set to 0.005 and relative tolerance set to 5e-05
Validation error in output 2 (linear2) on line 583 in file tests/cpp/test_multidevice_matmul.cpp. Detected abs error of: 4.08847 absolute tolerance was set to 2 and relative tolerance set to 0.02
Originally posted by @cowanmeg in https://github.com/NVIDIA/Fuser/issues/2360#issuecomment-2189981596