Closed jiyang1011 closed 2 months ago
Thanks for the PR. For the Split-k/StreamK, we need to have the flag memory and read/write from/to flag memory, and not loading/storing the actual data, similar to what Nvidia did. I think it would be better to use SYCLCompat
Atomic_compare_and_exchange_strong
similar toatomicCAS
enabled for Nvida in the link I sent above Would you be able to look into the link and implement it like that? This is the function that the barrier manager in Nvidia implementation for Split/StreamK uses to update the flag memory.
For reference, there is a SYCL implementation of atomicCAS
in gpu_generic.h. It just calls the syclcompat function @mehdi-goli mentioned.
just to mention that we dont need atomic add for Splitk/Streamk
What's the plan to do the reduction for each?
Why would we implement barrier with CAS and not use the existing barrier?
just to mention that we dont need atomic add for Splitk/Streamk
What's the plan to do the reduction for each?
Why would we implement barrier with CAS and not use the existing barrier?
The Flag memory is the integer memory used to sync across the workgroup, hence we can avoid having separate kernel for reduction. With atomicCAS, you only update the boolean value for informing other workgroup that I am done with writing intermediate output. This usually get updated by workitem0 when all the threads within the workgroup to perform gemm once local barrier condition satisfied. There is one memory flag per tile. For example, for the 2splitk tile algorithm (sharing the last 2 wave) across 4 workgroup based on the problem size provided in the paper (figure 3c) you need only one boolean flag memory the streamK tile. So the total would be 5 (NumStreamKTiles*NumSplits= 5x1).
For the reduction part, it is partial reduction and only the consumer workgroup need to check the flag, if the data is ready it can read it from global, sum it with its output which is currently in register and write it back to the final memory.
Using atomic add for reducing the data help us to reduce the cost of temporary global memory usage. In this case we can use atomic add to reduce the global memory usage, but we don't need to implement it in the copy function. Nvidia put the implementation of atomic add here and it uses for reduction inside the algorithm If we go for copy approach we will create a deviation from Nvidia approaches which we should properly document it and say why we need to put it in the copy.
Also, since the title of your PR says atomic add (just to mention that we dont need atomic add for Splitk/Streamk), Nvidia implemented Atomic add in the functional.h file. Incase we need it for any other operations, it is better to follow that pattern and put it there.
Hi mehdi, I checked the code of Xetla. I think you are right, atomic add is used for sync signal. But I have a question that atomicCAS
is under cutlass 2.0, whether a customized CollectiveMMA
should be re-written for stream-k algorithm and some necessary components such as memory fence, signal sync as utils?
Thanks for the PR. For the Split-k/StreamK, we need to have the flag memory and read/write from/to flag memory, and not loading/storing the actual data, similar to what Nvidia did. I think it would be better to use SYCLCompat
Atomic_compare_and_exchange_strong
similar toatomicCAS
enabled for Nvida in the link I sent above Would you be able to look into the link and implement it like that? This is the function that the barrier manager in Nvidia implementation for Split/StreamK uses to update the flag memory.For reference, there is a SYCL implementation of
atomicCAS
in gpu_generic.h. It just calls the syclcompat function @mehdi-goli mentioned. (https://github.com/codeplaysoftware/cutlass-fork/blob/64acac8f1043375d11dfbe93dca04a841f68e773/include/cutlass/gpu_generics.h#L327). seems missingreturn
?
Also, since the title of your PR says atomic add (just to mention that we dont need atomic add for Splitk/Streamk), Nvidia implemented Atomic add in the functional.h file. Incase we need it for any other operations, it is better to follow that pattern and put it there.
Hi mehdi, I checked the code of Xetla. I think you are right, atomic add is used for sync signal. But I have a question that
atomicCAS
is under cutlass 2.0, whether a customizedCollectiveMMA
should be re-written for stream-k algorithm and some necessary components such as memory fence, signal sync as utils?
tile_store this store action used atomic add for reduction.
I don't think it is missing a return
. The return is outside the #endif
at https://github.com/codeplaysoftware/cutlass-fork/blob/64acac8f1043375d11dfbe93dca04a841f68e773/include/cutlass/gpu_generics.h#L329 .
But I do think the return is incorrect. atomicCAS() should return the old value. Whereas it currently always returns 0. I don't think it is possible to correctly implement atomicCAS
with atomic_ref
(sycl or std) because it doesn't have an operation which compares and returns the old value atomicly. All usage (https://github.com/search?q=repo%3ANVIDIA%2Fcutlass+atomiccas&type=code) of the return value just compare it to desired. I suggest we replace atomicCAS
with atomic_compare_exchange
which directly maps to std/sycl. And for CUDA it maps to return atomicCAS(address, expected, desired)==expected
.
Yes, it is missing the return. Good catch! It should be something like:
CUTLASS_DEVICE int atomicCAS(int *address, int compare, int val) {
#if defined(__SYCL_DEVICE_ONLY__)
return syclcompat::atomic_compare_exchange_strong(address, compare, val);
#else
CUTLASS_NOT_IMPLEMENTED()
return 0;
#endif
}
syclcompat::atomic_compare_exchange_strong
returns the value in address
before the call.
Also, since the title of your PR says atomic add (just to mention that we dont need atomic add for Splitk/Streamk), Nvidia implemented Atomic add in the functional.h file. Incase we need it for any other operations, it is better to follow that pattern and put it there.
Hi mehdi, I checked the code of Xetla. I think you are right, atomic add is used for sync signal. But I have a question that
atomicCAS
is under cutlass 2.0, whether a customizedCollectiveMMA
should be re-written for stream-k algorithm and some necessary components such as memory fence, signal sync as utils?
atomicCAS
is a Cuda function used in CUTLASS. It is used in both CUTLASS 2 and 3.
Regarding atomic_add
, could you please check if the CUTLASS implementation of struct atomic_add
in functional.h covers the use case you are trying to address with this PR? Nvidia uses that struct as a base component to implement other functionalities (such as reduction). If possible, we would like to follow the same path to avoid reimplementing everything for Intel.
Also, since the title of your PR says atomic add (just to mention that we dont need atomic add for Splitk/Streamk), Nvidia implemented Atomic add in the functional.h file. Incase we need it for any other operations, it is better to follow that pattern and put it there.
Hi mehdi, I checked the code of Xetla. I think you are right, atomic add is used for sync signal. But I have a question that
atomicCAS
is under cutlass 2.0, whether a customizedCollectiveMMA
should be re-written for stream-k algorithm and some necessary components such as memory fence, signal sync as utils?
Hi @jiyang1011, I believe we can reuse the existing the CollectiveMma
atom (with minor modifications) for splitK/streamK GEMM as well. We would definitely need to implement a specialized pipeline for splitK/streamK similar to what NVIDIA does (e.g. sm90 streamK pipeline)
syclcompat::atomic_compare_exchange_strong returns the value in address before the call.
It doesn't: https://github.com/intel/llvm/issues/15155 . Even if the bug is fixed, using this API isn't very efficient. It requires and extra unnecessary load and comparison. All usage of the call just care about whether old==expected
. Which is the boolean return value of (std/sycl::)atomic_ref::atomic_compare_exchange
. But because the API returns the old value rather than the boolean, the calling code has to do an extra comparison and the implementation of syclcompat::atomic_compare_exchange_strong
needs to do an extra load and extra comparison.
syclcompat::atomic_compare_exchange_strong returns the value in address before the call.
It doesn't: intel/llvm#15155 . Even if the bug is fixed, using this API isn't very efficient. It requires and extra unnecessary load and comparison. All usage of the call just care about whether
old==expected
. Which is the boolean return value of (std/sycl::)atomic_ref::atomic_compare_exchange
. But because the API returns the old value rather than the boolean, the calling code has to do an extra comparison and the implementation ofsyclcompat::atomic_compare_exchange_strong
needs to do an extra load and extra comparison.
I think there is no bug. atm.compare_exchange_strong(expected, ...)
will handle the comparison and assign the value in address
to expected
if needed. So, after calling atm.compare_exchange_strong
, expected
will contain either the expected value or the old value, depending on the comparison.
https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.atomics.functions
Regarding the API being inefficient, I think you are right. We can look into bool syclcompat::compare_exchange_strong
or just use the SYCL function directly.
This patch is for atomic add feature which is a necessary component for split-k and stream-k algorithm.
copy_traits
is same withUniversalCopy