Closed mehmetyusufoglu closed 6 months ago
Hi @mehmetyusufoglu, thanks for looking into this.
For the CUDA backend, I think you can leave the warp size hard coded to 32, since all CUDA GPUs have that warp size.
For the HIP backend, you can use hipDeviceGetAttribute(&size, hipDeviceAttributeWarpSize, device)
.
Hi @mehmetyusufoglu, thanks for looking into this.
For the CUDA backend, I think you can leave the warp size hard coded to 32, since all CUDA GPUs have that warp size.
For the HIP backend, you can use
hipDeviceGetAttribute(&size, hipDeviceAttributeWarpSize, device)
.
Thank you. Sure, I am fixing.
This compilation error happens only at clang-14_cuda-11.2_debug_analysis compilation setup. I could not get the reason.
This compilation error happens only at clang-14_cuda-11.2_debug_analysis compilation setup. I could not get the reason.
When you are compiling for CUDA the other backends are not enabled, therefore backend-specific objects (e.g. ApiHipRt
) are not defined (the Tag
s are the only exception).
That's why the specialization for CUDA was wrapped in an #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
.
The correct way of writing getPreferredWarpSize()
should be something like:
ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(TApi::deviceGetAttribute(
&warpSize,
TApi::deviceAttributeWarpSize,
dev.getNativeHandle()));
with the generic TApi
instead of if
/else
. TApi
will then be specialized to either ApiCudaRt
or ApiHipRt
depending on the target you are compiling for.
However, as Andrea suggested, you should keep the specialization for CUDA with the 32 hard-coded, since it's the fastest way of obtaining the warp size.
Thanks, i am fixing.
6 Mart 2024 Çarşamba tarihinde Aurora Perego @.***> yazdı:
This compilation error happens only at clang-14_cuda-11.2_debug_analysis compilation setup. I could not get the reason.
When you are compiling for CUDA the other backends are not enabled, therefore backend-specific objects (e.g. ApiHipRt) are not defined (the Tags are the only exception). That's why the specialization for CUDA was wrapped in an #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED. The correct way of writing getPreferredWarpSize() should be something like:
ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(TApi::deviceGetAttribute( &warpSize, TApi::deviceAttributeWarpSize, dev.getNativeHandle()));
with the generic TApi instead of if/else. TApi will then be specialized to either ApiCudaRt or ApiHipRt depending on the target you are compiling for.
However, as Andrea suggested, you should keep the specialization for CUDA with the 32 hard-coded, since it's the fastest way of obtaining the warp size.
— Reply to this email directly, view it on GitHub https://github.com/alpaka-group/alpaka/pull/2246#issuecomment-1981893075, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUYXIJLMMHOEN2ISSL6YMMLYW6H4ZAVCNFSM6AAAAABEICDQH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRHA4TGMBXGU . You are receiving this because you were mentioned.Message ID: @.***>
Why do you use a different approach for the CUDA and HIP cases ?
Either define them both in include/alpaka/dev/DevUniformCudaHipRt.hpp
, or define each of them in include/alpaka/dev/DevCudaRt.hpp
and include/alpaka/dev/DevHipRt.hpp
.
For example, the minimal changes would have been to change just
typename TApi::DeviceProp_t devProp;
ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(TApi::getDeviceProperties(&devProp, dev.getNativeHandle()));
to
int warpSize = 0;
ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(
TApi::deviceGetAttribute(&warpSize, TApi::deviceAttributeWarpSize, dev.getNativeHandle()));
return static_cast<std::size_t>(warpSize);
TApi::deviceGetAttribute
Thanks. Let me finish, I don't want to return 32 from the deviceGetAttribute of Cuda, i will specialize GetPreferredWarpSize only. As you said in 2 different files.
For example, the minimal changes would have been to change just
typename TApi::DeviceProp_t devProp; ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(TApi::getDeviceProperties(&devProp, dev.getNativeHandle()));
to
int warpSize = 0; ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK( TApi::deviceGetAttribute(&warpSize, TApi::deviceAttributeWarpSize, dev.getNativeHandle())); return static_cast<std::size_t>(warpSize);
@fwyzard Do you think changing deviceGetAttibute is better
int warpSize = 0; ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK( TApi::deviceGetAttribute(&warpSize, TApi::deviceAttributeWarpSize, dev.getNativeHandle())); return static_cast<std::size_t>(warpSize);
@fwyzard What is your opinion, in my opinion, changing deviceGetAttribute
function of ApiCudaTr
if the argument (the attribute) is deviceAttributeWarpSize
, is not a clean solution. In the current solution I proposed; there is no if-else or nested #ifs anymore. Your suggestion needs less changes for sure.
@fwyzard Do you think changing deviceGetAttibute is better
no, TApi::deviceGetAttibute
is a wrapper around the underlying API function, and it should keep the same behaviour and interface.
However, we don't need to call TApi::deviceGetAttibute
for the CUDA architecture, because we know that the warp size is always 32.
There is one potential problem with this approach: a file that includes alpaka/dev/DevUniformCudaHipRt.hpp
but does not include explicitly alpaka/dev/DevHipRt.hpp
or alpaka/dev/DevCudaRt.hpp
will miss the getPreferredWarpSize()
specialisations.
Basically this change makes DevUniformCudaHipRt.hpp
an internal implementation detail, that other code shuld not use directly.
@psychocoderHPC do you think this is a problem ?
There is one potential problem with this approach: a file that includes
alpaka/dev/DevUniformCudaHipRt.hpp
but does not include explicitlyalpaka/dev/DevHipRt.hpp
oralpaka/dev/DevCudaRt.hpp
will miss thegetPreferredWarpSize()
specialisations.Basically this change makes
DevUniformCudaHipRt.hpp
an internal implementation detail, that other code shuld not use directly.@psychocoderHPC do you think this is a problem ?
Ok, I selected the shorter solution suggested by Andrea, @fwyzard . Specialisations in 2 different files is cleaner but needed many changes.
It was noticed that the use of the
alpaka::getWarpSizes(device)
function incurs a noticeable overhead. (The Issue #2192 )Solution:
Cuda: The fastest option is returning 32, it is not changed. But specialisation is taken to the DevCudaRt.hpp file, which is much cleaner than using nested preprocessor statements.
Hip: Rather than getting the whole device properties and selecting the attribute from that struct; getting directly the attribute seems a faster solution.
hipDeviceGetAttribute(&size, hipDeviceAttributeWarpSize, device)
(credits to @fwyzard )Notes: deviceAttributeWarpSize is added to ApiCudaRt.hpp but is not used.