KhronosGroup / OpenCL-CTS

The OpenCL Conformance Tests
Apache License 2.0
185 stars 199 forks source link

question about what guarnatees the correctness of the c11_atomics atomic_flag test #2052

Open karolherbst opened 2 months ago

karolherbst commented 2 months ago

This is the kernel being executed in the first atomic_flag test, but I'm actually wondering what in the OpenCL spec guarnatees that this executes as expected by the CTS.

More specifically, what prevents a thread to race with a thread from a different subgroup on destMemory[cnt]. One thread could enter the iteration for e.g. cnt == 5, while another thread just reached atomic_flag_clear_explicit(&destMemory[cnt], for the same cnt, which means that two threads will enter the critical section, just the second one won't do anything.

I'm seeing this behavior with rusticl on zink on radv and at the moment it's not clear to me if that's my or the CTS' bug. Removing the atomic_flag_clear_explicit(&destMemory[cnt], makes only one thread execute the "criticial section" for each value of cnt, but that fails the test later.

So what's actually guaranteeing the correctness of this test here? Or would this test need to be rewritten? I'm also mildly wondering what clvk did in order to pass this test or if that was never a problem in the first place. I can probably come up with a fix to guarantee that behavior, I'm just wondering if there is some undefined behavior at play here.

Program source:
__kernel void test_atomic_kernel(uint threadCount, uint numDestItems, __global int *finalDest, __global int *oldValues, volatile __local atomic_flag *destMemory)
{
  uint  tid = get_global_id(0);

              // initialize atomics not reachable from host (first thread
              // is doing this, other threads are waiting on barrier)
              if(get_local_id(0) == 0)
                for(uint dstItemIdx = 0; dstItemIdx < numDestItems; dstItemIdx++)
                {
                  if(finalDest[dstItemIdx])
                    atomic_flag_test_and_set_explicit(destMemory+dstItemIdx,
                                                      memory_order_relaxed,
                                                      memory_scope_work_group);
                  else
                    atomic_flag_clear_explicit(destMemory+dstItemIdx,
                                               memory_order_relaxed,
                                               memory_scope_work_group);    }
  barrier(CLK_LOCAL_MEM_FENCE);

  uint cnt, stop = 0;
  for(cnt = 0; !stop && cnt < threadCount; cnt++) // each thread must find critical section where it is the first visitor
  {
    bool set = atomic_flag_test_and_set_explicit(&destMemory[cnt], memory_order_relaxed, memory_scope_work_group);
    atomic_work_item_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE, memory_order_acquire,memory_scope_work_group);
    if (!set)
    {
      uint csIndex = get_enqueued_local_size(0)*get_group_id(0)+cnt;
      // verify that thread is the first visitor
      if(oldValues[csIndex] == 1000000000)
      {
        oldValues[csIndex] = tid; // set the winner id for this critical section
        stop = 1;
      }
      atomic_work_item_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE, memory_order_release,memory_scope_work_group);
      atomic_flag_clear_explicit(&destMemory[cnt], memory_order_relaxed, memory_scope_work_group);
    }
  }

  // Copy final values to host reachable buffer
  barrier(CLK_LOCAL_MEM_FENCE);
  if(get_local_id(0) == 0) // first thread in workgroup
    for(uint dstItemIdx = 0; dstItemIdx < numDestItems; dstItemIdx++)

                finalDest[dstItemIdx] =
                    atomic_flag_test_and_set_explicit(destMemory+dstItemIdx,
                                                      memory_order_relaxed,
                                                      memory_scope_work_group);}
karolherbst commented 2 months ago

I dug a bit deeper and it seems I have threads racing on the if(oldValues[csIndex] == 1000000000). Should that be an atomic operation instead?

karolherbst commented 2 months ago

mhh, no, that would mean that something with atomic_flag_test_and_set_explicit is going wrong, but that's kinda of weird.. maybe there is a coherency issue somewhere, anyway, it's probably not a CTS bug then and I initially thought something else is going on.