NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
257 stars 51 forks source link

`FusionRepro2094_CUDA` is failing #374

Open zasdfgbnm opened 1 year ago

zasdfgbnm commented 1 year ago
[ RUN      ] NVFuserTest.FusionRepro2094_CUDA
unknown file: Failure
C++ exception with description "aten_output_tensor.allclose( fusion_output_tensor.to(aten_output_tensor.dtype()), tolerance_values.second, tolerance_values.first, true) INTERNAL ASSERT FAILED at "/home/gaoxiang/Fuser2/test/validator.h":399, please report a bug to PyTorch. 

Validation error in output 1 on line 6213 in file /home/gaoxiang/Fuser2/test/test_gpu3.cpp.
  Detected abs error of: 1.52588e-05
    absolute tolerance was set to 1.68222e-06
    and relative tolerance set to 2.23704e-06
Exception raised from testValidate at /home/gaoxiang/Fuser2/test/validator.h:399 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2e69cb25b7 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2e69c70719 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3f (0x7f2e69cb053f in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x2febd9 (0x55d37ba68bd9 in ./build/nvfuser_tests)
frame #4: <unknown function> + 0x318d2a (0x55d37ba82d2a in ./build/nvfuser_tests)
frame #5: <unknown function> + 0x55de57 (0x55d37bcc7e57 in ./build/nvfuser_tests)
frame #6: <unknown function> + 0x55249d (0x55d37bcbc49d in ./build/nvfuser_tests)
frame #7: <unknown function> + 0x55266d (0x55d37bcbc66d in ./build/nvfuser_tests)
frame #8: <unknown function> + 0x552db7 (0x55d37bcbcdb7 in ./build/nvfuser_tests)
frame #9: <unknown function> + 0x553764 (0x55d37bcbd764 in ./build/nvfuser_tests)
frame #10: <unknown function> + 0x55e3c7 (0x55d37bcc83c7 in ./build/nvfuser_tests)
frame #11: <unknown function> + 0x552774 (0x55d37bcbc774 in ./build/nvfuser_tests)
frame #12: <unknown function> + 0x134f4a (0x55d37b89ef4a in ./build/nvfuser_tests)
frame #13: <unknown function> + 0x23850 (0x7f2e08239850 in /usr/lib/libc.so.6)
frame #14: __libc_start_main + 0x8a (0x7f2e0823990a in /usr/lib/libc.so.6)
frame #15: <unknown function> + 0x167d35 (0x55d37b8d1d35 in ./build/nvfuser_tests)
" thrown in the test body.
[  FAILED  ] NVFuserTest.FusionRepro2094_CUDA (3374 ms)
naoyam commented 1 year ago

Might this be a rounding issue? Does it repro deterministically?

jacobhinkle commented 1 year ago

We should prob set the seed in this test in case other changes cause changes to execution order in which case this might pop up again.

zasdfgbnm commented 1 year ago

Yeah, it does repro deterministically. I ran multiple times, and I always get an 1.52588e-05 error.