NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.31k stars 166 forks source link

`DeviceHistogram` doesn't support `CounterT=int64_t` due to missing `atomicAdd` overload #909

Open Nyrio opened 1 year ago

Nyrio commented 1 year ago

I am trying to use cub::DeviceHistogram::HistogramEven with CounterT=int64_t and get the following error:

cub/agent/agent_histogram.cuh(370): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (int64_t *, int64_t)
          detected during:
            instantiation of "void cub::CUB_101702_800_NS::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::AccumulatePixels(cub::CUB_101702_800_NS::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::SampleT (*)[NUM_CHANNELS], __nv_bool *, CounterT **, cub::CUB_101702_800_NS::Int2Type<1>) [with AgentHistogramPolicyT=cub::CUB_101702_800_NS::DispatchHistogram<1, 1, const uint32_t *, int64_t, int64_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=0, NUM_CHANNELS=1, NUM_ACTIVE_CHANNELS=1, SampleIteratorT=const uint32_t *, CounterT=int64_t, PrivatizedDecodeOpT=cub::CUB_101702_800_NS::DispatchHistogram<1, 1, const uint32_t *, int64_t, int64_t, int>::ScaleTransform, OutputDecodeOpT=cub::CUB_101702_800_NS::DispatchHistogram<1, 1, const uint32_t *, int64_t, int64_t, int>::PassThruTransform, OffsetT=int, PTX_ARCH=800]"
gevtushenko commented 1 year ago

Hello @Nyrio and thank you for reporting the issue! I can reproduce it on my side.

We require OffsetT to be signed integer type, but CounterT should be just integer type. I don't think we currently test anything except CounterT=int, so after fixing the algorithm for int64_t we might need to add tests for other integer types as well.

As a workaround, you should be able to use unsigned long long int instead of int64_t.

Nyrio commented 1 year ago

Regarding this limitation of atomicAdd, raft has a header with generic overloads, for long long int it just reinterpret casts, since the underlying binary operation is the same.

But as I understand, the more modern approach would be to use cuda::atomic_ref.