codeplaysoftware / cutlass-fork

CUDA Templates for Linear Algebra Subroutines
Other
8 stars 20 forks source link

atomic add #123

Closed jiyang1011 closed 2 months ago

jiyang1011 commented 3 months ago

This patch is for atomic add feature which is a necessary component for split-k and stream-k algorithm.

  1. limitation: the data type must be float / double / int [https://super-funicular-b7914353.pages.github.io/public/atomic-datatypes.html#data-types-in-atomic-operations]
  2. the memory space is global
  3. Its copy_traits is same with UniversalCopy
aacostadiaz commented 3 months ago

Thanks for the PR. For the Split-k/StreamK, we need to have the flag memory and read/write from/to flag memory, and not loading/storing the actual data, similar to what Nvidia did. I think it would be better to use SYCLCompat Atomic_compare_and_exchange_strong similar to atomicCAS enabled for Nvida in the link I sent above Would you be able to look into the link and implement it like that? This is the function that the barrier manager in Nvidia implementation for Split/StreamK uses to update the flag memory.

For reference, there is a SYCL implementation of atomicCAS in gpu_generic.h. It just calls the syclcompat function @mehdi-goli mentioned.

rolandschulz commented 3 months ago

just to mention that we dont need atomic add for Splitk/Streamk

What's the plan to do the reduction for each?

Why would we implement barrier with CAS and not use the existing barrier?

mehdi-goli commented 3 months ago

just to mention that we dont need atomic add for Splitk/Streamk

What's the plan to do the reduction for each?

Why would we implement barrier with CAS and not use the existing barrier?

The Flag memory is the integer memory used to sync across the workgroup, hence we can avoid having separate kernel for reduction. With atomicCAS, you only update the boolean value for informing other workgroup that I am done with writing intermediate output. This usually get updated by workitem0 when all the threads within the workgroup to perform gemm once local barrier condition satisfied. There is one memory flag per tile. For example, for the 2splitk tile algorithm (sharing the last 2 wave) across 4 workgroup based on the problem size provided in the paper (figure 3c) you need only one boolean flag memory the streamK tile. So the total would be 5 (NumStreamKTiles*NumSplits= 5x1).

For the reduction part, it is partial reduction and only the consumer workgroup need to check the flag, if the data is ready it can read it from global, sum it with its output which is currently in register and write it back to the final memory.

Using atomic add for reducing the data help us to reduce the cost of temporary global memory usage. In this case we can use atomic add to reduce the global memory usage, but we don't need to implement it in the copy function. Nvidia put the implementation of atomic add here and it uses for reduction inside the algorithm If we go for copy approach we will create a deviation from Nvidia approaches which we should properly document it and say why we need to put it in the copy.

jiyang1011 commented 3 months ago

Also, since the title of your PR says atomic add (just to mention that we dont need atomic add for Splitk/Streamk), Nvidia implemented Atomic add in the functional.h file. Incase we need it for any other operations, it is better to follow that pattern and put it there.

Hi mehdi, I checked the code of Xetla. I think you are right, atomic add is used for sync signal. But I have a question that atomicCAS is under cutlass 2.0, whether a customized CollectiveMMA should be re-written for stream-k algorithm and some necessary components such as memory fence, signal sync as utils?

jiyang1011 commented 3 months ago

Thanks for the PR. For the Split-k/StreamK, we need to have the flag memory and read/write from/to flag memory, and not loading/storing the actual data, similar to what Nvidia did. I think it would be better to use SYCLCompat Atomic_compare_and_exchange_strong similar to atomicCAS enabled for Nvida in the link I sent above Would you be able to look into the link and implement it like that? This is the function that the barrier manager in Nvidia implementation for Split/StreamK uses to update the flag memory.

For reference, there is a SYCL implementation of atomicCAS in gpu_generic.h. It just calls the syclcompat function @mehdi-goli mentioned. (https://github.com/codeplaysoftware/cutlass-fork/blob/64acac8f1043375d11dfbe93dca04a841f68e773/include/cutlass/gpu_generics.h#L327). seems missingreturn ?

jiyang1011 commented 3 months ago

Also, since the title of your PR says atomic add (just to mention that we dont need atomic add for Splitk/Streamk), Nvidia implemented Atomic add in the functional.h file. Incase we need it for any other operations, it is better to follow that pattern and put it there.

Hi mehdi, I checked the code of Xetla. I think you are right, atomic add is used for sync signal. But I have a question that atomicCAS is under cutlass 2.0, whether a customized CollectiveMMA should be re-written for stream-k algorithm and some necessary components such as memory fence, signal sync as utils?

tile_store this store action used atomic add for reduction.

rolandschulz commented 3 months ago

I don't think it is missing a return. The return is outside the #endif at https://github.com/codeplaysoftware/cutlass-fork/blob/64acac8f1043375d11dfbe93dca04a841f68e773/include/cutlass/gpu_generics.h#L329 .

But I do think the return is incorrect. atomicCAS() should return the old value. Whereas it currently always returns 0. I don't think it is possible to correctly implement atomicCAS with atomic_ref (sycl or std) because it doesn't have an operation which compares and returns the old value atomicly. All usage (https://github.com/search?q=repo%3ANVIDIA%2Fcutlass+atomiccas&type=code) of the return value just compare it to desired. I suggest we replace atomicCAS with atomic_compare_exchange which directly maps to std/sycl. And for CUDA it maps to return atomicCAS(address, expected, desired)==expected.

aacostadiaz commented 3 months ago

Yes, it is missing the return. Good catch! It should be something like:

CUTLASS_DEVICE int atomicCAS(int *address, int compare, int val) {
#if defined(__SYCL_DEVICE_ONLY__)
  return syclcompat::atomic_compare_exchange_strong(address, compare, val);
#else
  CUTLASS_NOT_IMPLEMENTED()
  return 0;
#endif
}

syclcompat::atomic_compare_exchange_strong returns the value in address before the call.

aacostadiaz commented 3 months ago

Also, since the title of your PR says atomic add (just to mention that we dont need atomic add for Splitk/Streamk), Nvidia implemented Atomic add in the functional.h file. Incase we need it for any other operations, it is better to follow that pattern and put it there.

Hi mehdi, I checked the code of Xetla. I think you are right, atomic add is used for sync signal. But I have a question that atomicCAS is under cutlass 2.0, whether a customized CollectiveMMA should be re-written for stream-k algorithm and some necessary components such as memory fence, signal sync as utils?

atomicCAS is a Cuda function used in CUTLASS. It is used in both CUTLASS 2 and 3.

Regarding atomic_add, could you please check if the CUTLASS implementation of struct atomic_add in functional.h covers the use case you are trying to address with this PR? Nvidia uses that struct as a base component to implement other functionalities (such as reduction). If possible, we would like to follow the same path to avoid reimplementing everything for Intel.

muhammad-tanvir-1211 commented 3 months ago

Also, since the title of your PR says atomic add (just to mention that we dont need atomic add for Splitk/Streamk), Nvidia implemented Atomic add in the functional.h file. Incase we need it for any other operations, it is better to follow that pattern and put it there.

Hi mehdi, I checked the code of Xetla. I think you are right, atomic add is used for sync signal. But I have a question that atomicCAS is under cutlass 2.0, whether a customized CollectiveMMA should be re-written for stream-k algorithm and some necessary components such as memory fence, signal sync as utils?

Hi @jiyang1011, I believe we can reuse the existing the CollectiveMma atom (with minor modifications) for splitK/streamK GEMM as well. We would definitely need to implement a specialized pipeline for splitK/streamK similar to what NVIDIA does (e.g. sm90 streamK pipeline)

rolandschulz commented 3 months ago

syclcompat::atomic_compare_exchange_strong returns the value in address before the call.

It doesn't: https://github.com/intel/llvm/issues/15155 . Even if the bug is fixed, using this API isn't very efficient. It requires and extra unnecessary load and comparison. All usage of the call just care about whether old==expected. Which is the boolean return value of (std/sycl::)atomic_ref::atomic_compare_exchange. But because the API returns the old value rather than the boolean, the calling code has to do an extra comparison and the implementation of syclcompat::atomic_compare_exchange_strong needs to do an extra load and extra comparison.

aacostadiaz commented 3 months ago

syclcompat::atomic_compare_exchange_strong returns the value in address before the call.

It doesn't: intel/llvm#15155 . Even if the bug is fixed, using this API isn't very efficient. It requires and extra unnecessary load and comparison. All usage of the call just care about whether old==expected. Which is the boolean return value of (std/sycl::)atomic_ref::atomic_compare_exchange. But because the API returns the old value rather than the boolean, the calling code has to do an extra comparison and the implementation of syclcompat::atomic_compare_exchange_strong needs to do an extra load and extra comparison.

I think there is no bug. atm.compare_exchange_strong(expected, ...) will handle the comparison and assign the value in address to expected if needed. So, after calling atm.compare_exchange_strong, expected will contain either the expected value or the old value, depending on the comparison.

https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.atomics.functions image

Regarding the API being inefficient, I think you are right. We can look into bool syclcompat::compare_exchange_strong or just use the SYCL function directly.