NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
245 stars 48 forks source link

printing real error messages with `parallel_compile` #1871

Open jjsjann123 opened 6 months ago

jjsjann123 commented 6 months ago

Currently when parallel_compile is enabled (Note this is also our current default behavior), FusionKernelRuntime hides all compilation error messages. See here: https://github.com/NVIDIA/Fuser/blob/8226e61a2a843e8e4ad8c6c2801b019f654828d1/csrc/kernel_cache.cpp#L1310 https://github.com/NVIDIA/Fuser/blob/8226e61a2a843e8e4ad8c6c2801b019f654828d1/csrc/kernel_cache.cpp#L1290

This makes initial debugging tricky in our CI, as well as how we imagine user would be reporting error and repro. Because we wouldn't be able to look at a real error message.

@rdspring1 mentioned that, the reason that we masked the error message was to hide false positive / cryptic error messages. Think about how one fusion segment fails to compile would cause all downstream segments to fail automatically and throw out lots of errors.

I think the concern is real. Any filtering we applied to the aggregated error message could lead to false-positive/negatives. But filtered error messages are still better than nothing.

We should keep the existing warning about turning off parallel_compile for an accurate error message. Meanwhile, plumbing through the aggregated error message and filtering out redundant ones should also be added, so CI log would be of more help in our root-causing.

naoyam commented 6 months ago

I chatted with @xwang233 before, but it would be nice if a failure could automatically trigger a second CI trial with the parallel compilation disabled.

jjsjann123 commented 6 months ago

@xwang233 also mentioned to me about tracing the dependency in each fusion segment during parallel compilation. If we figure out the root segments with an exception, that's likely the issue that will show up during rerun without parallel compile.

Sounds like an interesting approach that worth some quick investigation.

xwang233 commented 6 months ago

A rerun for nvfuser python tests with NVFUSER_DISABLE=parallel_compile was added to CI and the new stacktrace will be reported.

Keeping this issue open in case @jjsjann123 wants to implement the exception reporting on default setup.