ROCm / ROCm-OpenCL-Runtime

ROCm OpenOpenCL Runtime
170 stars 60 forks source link

Device side enqueue causes a crash in bizarre circumstances/possible compiler bug #132

Open 20k opened 3 years ago

20k commented 3 years ago

Hi there! I recently bought a 6700xt and have been running into a few issues with the OpenCL support, top of which is that device side enqueues via enqueue_kernel seem to cause a crash in some circumstances. Someone pointed me over here and said I should file a bug. If this isn't the appropriate place, I apologise! :)

Unfortunately during the course of producing a minimal repro, I've discovered that the crash is dependent on surrounding unused code. I had to chop this example down from a much larger source, and chopping it down further is more difficult

https://pastebin.com/0rkny8g5

Built with -lopencl with mingw64's gcc

The output I get is this:

NAME __relauncher_generic_block_invoke_kernel
NAME get_geodesic_path
NAME relauncher_generic
Result 0
0x00007FFBCC2FCF69 (0x0000000000000006 0x000000CB517FF9D0 0x000002172B3C31A0 0x0000000000000000), clGetPipeInfo() + 0xD46B9 bytes(s)
0x00007FFBCC303C2D (0x000000CB517FFC00 0x0000000000000001 0x000002172A028D20 0x000002172A029AB0), clGetPipeInfo() + 0xDB37D bytes(s)
0x00007FFBCC303266 (0x000002172A5F6950 0x0000000000000000 0x000002172B3C31B8 0x0000000000000000), clGetPipeInfo() + 0xDA9B6 bytes(s)
0x00007FFBCC238EEE (0x000002172A5F6950 0x000002172B3C31A0 0x000002172A5F6950 0x000002172A5F6CC0), clGetPipeInfo() + 0x1063E bytes(s)
0x00007FFBCC238FE1 (0x000002172A4EDE00 0x0000000000000000 0x000002172A4EDE00 0x0000000000000000), clGetPipeInfo() + 0x10731 bytes(s)
0x00007FFBCC2294DA (0x000002172A4EDE00 0x0000000000000000 0x0000000000000000 0x0000000000000000), clGetPipeInfo() + 0xC2A bytes(s)
0x00007FFBCC248CD9 (0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000), clGetPipeInfo() + 0x20429 bytes(s)
0x00007FFC0DAC7034 (0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000), BaseThreadInitThunk() + 0x14 bytes(s)
0x00007FFC0F962651 (0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000), RtlUserThreadStart() + 0x21 bytes(s)

The specific crash line appears to be clFinish(cqueue);

Removing the device side enqueue appears to fix the crash. Removing several other unused pieces of code (eg the other unused kernel, or the marked completely unused function) also appears to fix the crash. Reworking some of the code in various miscellaneous ways also appears to fix the crash. Removing either of the first two build flags also seems to fix the crash

The only kernel which is actually run there is relauncher_generic, which does essentially nothing other than enqueue an empty block

Specs: Windows 10 Pro 21H1, 5800x, 6700xt with driver 21.3.2, 16GB ddr4. Its a brand new pc on a brand new install of windows, so there's not much else going on here. The code from the larger project this is derived from worked without issue on an r9 390 as of few weeks ago, though unfortunately I do not have that GPU to test as it has melted

If you need any more info or anything else, please let me know!

vsytch commented 3 years ago

Thanks for the report, we've reproduced the issue internally.

vsytch commented 3 years ago

We've submitted the fix internally, but we missed backporting the change to the 21.10 driver (that released 2 days ago). We'll update this issue once a public driver will be available with this fix.

20k commented 3 years ago

Thanks very much for the update, its nice to see that this has been reproduced and fixed so quickly!