PointKernel commented 7 months ago

Is your feature request related to a problem? Please describe.

The current cuco implementations use custom atomic functions, e.g. https://github.com/NVIDIA/cuCollections/blob/1c8b92074d9a0d07ff9288626c22ab4f5fb9d6ad/include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh#L904-L936 due to a performance regression with cuda::atomic_ref (https://github.com/NVIDIA/cccl/issues/1008). With the fix being merged into the main branch, we can get rid of those custom functions once CCCL 2.4 is fetched by rapids-cmake

Describe the solution you'd like

Replace https://github.com/NVIDIA/cuCollections/blob/1c8b92074d9a0d07ff9288626c22ab4f5fb9d6ad/include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh#L905 https://github.com/NVIDIA/cuCollections/blob/1c8b92074d9a0d07ff9288626c22ab4f5fb9d6ad/include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh#L947 https://github.com/NVIDIA/cuCollections/blob/1c8b92074d9a0d07ff9288626c22ab4f5fb9d6ad/include/cuco/detail/hyperloglog/hyperloglog_ref.cuh#L525 with corresponding atomic_ref operations.

bdice commented 5 months ago

I think #502 unblocks this issue.

PointKernel commented 5 months ago

I think #502 unblocks this issue.

Right. But #502 is blocked by some CCCL issues.

mfbalin commented 5 months ago

502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.

If this moves forward, I can get rid of the walkaround in my work.

PointKernel commented 5 months ago

502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.

If this moves forward, I can get rid of the walkaround in my work.

This work is on my radar but I'm not following the issue you brought up here. The current cuco should work fine on Pascal or newer GPUs.

mfbalin commented 5 months ago

502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.

If this moves forward, I can get rid of the walkaround in my work.

This work is on my radar but I'm not following the issue you brought up here. The current cuco should work fine on Pascal or newer GPUs.

We compile our library for older GPUs as well such as sm_52, that is what I had meant, filtering these older architectures and creating a seperate target for the codes using cuco. https://github.com/dmlc/dgl/blob/713ffb5714dbd0bd92b26d6412f5040092179ee3/graphbolt/CMakeLists.txt#L61-L69

Thank you for your work!

mfbalin commented 5 months ago

Or can I already start using cuco when I am compiling for sm_52, sm_60, sm_70, sm_80 etc, especially including sm_52 since it does not have the required support. Previously, I was getting a compilation error because cuda/atomic header was not importable when the minimum architecture is older than Pascal.

Note: It is ensured that cuco code path is not taken if running on an old GPU. This has to do with compilation flow.

PointKernel commented 5 months ago

Previously, I was getting a compilation error because cuda/atomic header was not importable when the minimum architecture is older than Pascal.

Yeah, I remember this but I'm not sure whether the new cuda/atomic with CCCL upgrade can solve the issue. Will back to you.

PointKernel commented 4 months ago

@mfbalin FYI, the corresponding PR gets merged but I think you still need a separate target for old GPUs.

mfbalin commented 4 months ago

@mfbalin FYI, the corresponding PR gets merged but I think you still need a separate target for old GPUs.

Is it because https://github.com/NVIDIA/cccl/issues/1083 is still open?

PointKernel commented 4 months ago

@mfbalin FYI, the corresponding PR gets merged but I think you still need a separate target for old GPUs.

Is it because NVIDIA/cccl#1083 is still open?

Yes

NVIDIA / cuCollections

[ENHANCEMENT]: Get rid of of custom atomic operations once CCCL 2.4 is ready #469

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.

502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.

502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.