Open jjennychen opened 4 months ago
Yep, this is an unfortunate difference in printf() between CUDA and OpenCL which is not trivial to fix. I don't think it requires format strings to get any possible output ordering between the threads/WIs. It's down to the OpenCL driver's printf implementation what happens. Some could flush at newline boundaries, some just push chars to a shared "stdout ring buffer" (like PoCL does).
CUDA does the actual printing on host by transferring the fmtstr and the args whereas OpenCL printf can perform it (more) on the device, which means it's not guaranteed to be flushed at string boundaries. We thought about doing a similar implementation (borrow one from AMD ROCm for instance) but it needs non-trivial amount of both compiler and runtime work to get it right and portable.
A clean way to fix this would be to propose an OpenCL extension that can be used to enforce printf() strings to get flushed at \n to make unsynchronized multi-work-item output more readable. Meanwhile we are relying on the OpenCL driver-specific behavior.
When using
printf
with string format specifiers (%s
), the output of the specified strings appears to be unsynchronized. Below is the outputs from chipStar assertion and CUDA assertion when runningassert-cuda
benchmark in HeCBench:CUDA:
chipStar:
[Reproducer] Compile and run the following code in a
.cu
file: printf.cuThe output of the program will be something similar to this: