printf with string format specifier is not synchronized

When using printf with string format specifiers (%s), the output of the specified strings appears to be unsynchronized. Below is the outputs from chipStar assertion and CUDA assertion when running assert-cuda benchmark in HeCBench:

CUDA:

main.cu:21: void testKernel(int): block: [1,0,0], thread: [28,0,0] Assertion `gid < N` failed.
main.cu:21: void testKernel(int): block: [1,0,0], thread: [29,0,0] Assertion `gid < N` failed.
main.cu:21: void testKernel(int): block: [1,0,0], thread: [30,0,0] Assertion `gid < N` failed.
main.cu:21: void testKernel(int): block: [1,0,0], thread: [31,0,0] Assertion `gid < N` failed.

chipStar:

mmmmaaaaiiiinnnn....ccccuuuu:21: :21: :21: :21: vvvvooooiiiidddd    tttteeeessssttttKKKKeeeerrrrnnnneeeellll((((iiiinnnntttt)))): Device-side assertion `: Device-side assertion `: Device-side assertion `: Device-side assertion `ggggiiiidddd    <<<<    NNNN' failed.
' failed.
' failed.
' failed.

[Reproducer] Compile and run the following code in a .cu file: printf.cu

#include <iostream>

__global__ void print() {
  const char* file = "testing.c";
  const char* function = "function";
  const char* message = "assertion message";
  unsigned int line = 3;
  printf("%s:%u: %s: Device-side assertion `%s' failed.\n", file, line, function, message);
}

int main(int argc, char** argv) {
  print<<<1, 3>>>();
}

The output of the program will be something similar to this:

testing.ctesting.ctesting.c:3: :3: :3: functionfunctionfunction: Device-side assertion `: Device-side assertion `: Device-side assertion `assertion messageassertion messageassertion message' failed.
' failed.
' failed.

Yep, this is an unfortunate difference in printf() between CUDA and OpenCL which is not trivial to fix. I don't think it requires format strings to get any possible output ordering between the threads/WIs. It's down to the OpenCL driver's printf implementation what happens. Some could flush at newline boundaries, some just push chars to a shared "stdout ring buffer" (like PoCL does).

CUDA does the actual printing on host by transferring the fmtstr and the args whereas OpenCL printf can perform it (more) on the device, which means it's not guaranteed to be flushed at string boundaries. We thought about doing a similar implementation (borrow one from AMD ROCm for instance) but it needs non-trivial amount of both compiler and runtime work to get it right and portable.

A clean way to fix this would be to propose an OpenCL extension that can be used to enforce printf() strings to get flushed at \n to make unsynchronized multi-work-item output more readable. Meanwhile we are relying on the OpenCL driver-specific behavior.

CHIP-SPV / chipStar

printf with string format specifier is not synchronized #883