Fix slow getWarpSize problem

mehmetyusufoglu commented 7 months ago

It was noticed that the use of the alpaka::getWarpSizes(device) function incurs a noticeable overhead. (The Issue #2192 )

Solution:

Cuda: The fastest option is returning 32, it is not changed. But specialisation is taken to the DevCudaRt.hpp file, which is much cleaner than using nested preprocessor statements.
Hip: Rather than getting the whole device properties and selecting the attribute from that struct; getting directly the attribute seems a faster solution. hipDeviceGetAttribute(&size, hipDeviceAttributeWarpSize, device) (credits to @fwyzard )

Notes: deviceAttributeWarpSize is added to ApiCudaRt.hpp but is not used.

fwyzard commented 7 months ago

Hi @mehmetyusufoglu, thanks for looking into this.

For the CUDA backend, I think you can leave the warp size hard coded to 32, since all CUDA GPUs have that warp size.

For the HIP backend, you can use hipDeviceGetAttribute(&size, hipDeviceAttributeWarpSize, device).

mehmetyusufoglu commented 7 months ago

Hi @mehmetyusufoglu, thanks for looking into this.

For the CUDA backend, I think you can leave the warp size hard coded to 32, since all CUDA GPUs have that warp size.

For the HIP backend, you can use hipDeviceGetAttribute(&size, hipDeviceAttributeWarpSize, device).

Thank you. Sure, I am fixing.

mehmetyusufoglu commented 7 months ago

This compilation error happens only at clang-14_cuda-11.2_debug_analysis compilation setup. I could not get the reason.

Screenshot from 2024-03-06 14-33-09

AuroraPerego commented 7 months ago

This compilation error happens only at clang-14_cuda-11.2_debug_analysis compilation setup. I could not get the reason.

When you are compiling for CUDA the other backends are not enabled, therefore backend-specific objects (e.g. ApiHipRt) are not defined (the Tags are the only exception). That's why the specialization for CUDA was wrapped in an #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED. The correct way of writing getPreferredWarpSize() should be something like:

ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(TApi::deviceGetAttribute(
    &warpSize,
    TApi::deviceAttributeWarpSize,
    dev.getNativeHandle()));

with the generic TApi instead of if/else. TApi will then be specialized to either ApiCudaRt or ApiHipRt depending on the target you are compiling for.

However, as Andrea suggested, you should keep the specialization for CUDA with the 32 hard-coded, since it's the fastest way of obtaining the warp size.

mehmetyusufoglu commented 7 months ago

Thanks, i am fixing.

6 Mart 2024 Çarşamba tarihinde Aurora Perego @.***> yazdı:

This compilation error happens only at clang-14_cuda-11.2_debug_analysis compilation setup. I could not get the reason.

When you are compiling for CUDA the other backends are not enabled, therefore backend-specific objects (e.g. ApiHipRt) are not defined (the Tags are the only exception). That's why the specialization for CUDA was wrapped in an #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED. The correct way of writing getPreferredWarpSize() should be something like:

ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(TApi::deviceGetAttribute( &warpSize, TApi::deviceAttributeWarpSize, dev.getNativeHandle()));

with the generic TApi instead of if/else. TApi will then be specialized to either ApiCudaRt or ApiHipRt depending on the target you are compiling for.

However, as Andrea suggested, you should keep the specialization for CUDA with the 32 hard-coded, since it's the fastest way of obtaining the warp size.

— Reply to this email directly, view it on GitHub https://github.com/alpaka-group/alpaka/pull/2246#issuecomment-1981893075, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUYXIJLMMHOEN2ISSL6YMMLYW6H4ZAVCNFSM6AAAAABEICDQH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRHA4TGMBXGU . You are receiving this because you were mentioned.Message ID: @.***>

fwyzard commented 6 months ago

Why do you use a different approach for the CUDA and HIP cases ?

Either define them both in include/alpaka/dev/DevUniformCudaHipRt.hpp, or define each of them in include/alpaka/dev/DevCudaRt.hpp and include/alpaka/dev/DevHipRt.hpp.

fwyzard commented 6 months ago

For example, the minimal changes would have been to change just

                typename TApi::DeviceProp_t devProp;
                ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(TApi::getDeviceProperties(&devProp, dev.getNativeHandle()));

to

                int warpSize = 0;
                ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(
                    TApi::deviceGetAttribute(&warpSize, TApi::deviceAttributeWarpSize, dev.getNativeHandle()));
                return static_cast<std::size_t>(warpSize);

mehmetyusufoglu commented 6 months ago

TApi::deviceGetAttribute

Thanks. Let me finish, I don't want to return 32 from the deviceGetAttribute of Cuda, i will specialize GetPreferredWarpSize only. As you said in 2 different files.

mehmetyusufoglu commented 6 months ago

For example, the minimal changes would have been to change just

                typename TApi::DeviceProp_t devProp;
                ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(TApi::getDeviceProperties(&devProp, dev.getNativeHandle()));

to

                int warpSize = 0;
                ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(
                    TApi::deviceGetAttribute(&warpSize, TApi::deviceAttributeWarpSize, dev.getNativeHandle()));
                return static_cast<std::size_t>(warpSize);

@fwyzard Do you think changing deviceGetAttibute is better

mehmetyusufoglu commented 6 months ago

int warpSize = 0;
                ALPAKA_UNIFORM_CUDA_HIP_RT_CHECK(
                    TApi::deviceGetAttribute(&warpSize, TApi::deviceAttributeWarpSize, dev.getNativeHandle()));
                return static_cast<std::size_t>(warpSize);

@fwyzard What is your opinion, in my opinion, changing deviceGetAttribute function of ApiCudaTr if the argument (the attribute) is deviceAttributeWarpSize, is not a clean solution. In the current solution I proposed; there is no if-else or nested #ifs anymore. Your suggestion needs less changes for sure.

fwyzard commented 6 months ago

@fwyzard Do you think changing deviceGetAttibute is better

no, TApi::deviceGetAttibute is a wrapper around the underlying API function, and it should keep the same behaviour and interface.

However, we don't need to call TApi::deviceGetAttibute for the CUDA architecture, because we know that the warp size is always 32.

fwyzard commented 6 months ago

There is one potential problem with this approach: a file that includes alpaka/dev/DevUniformCudaHipRt.hpp but does not include explicitly alpaka/dev/DevHipRt.hpp or alpaka/dev/DevCudaRt.hpp will miss the getPreferredWarpSize() specialisations.

Basically this change makes DevUniformCudaHipRt.hpp an internal implementation detail, that other code shuld not use directly.

@psychocoderHPC do you think this is a problem ?

mehmetyusufoglu commented 6 months ago

There is one potential problem with this approach: a file that includes alpaka/dev/DevUniformCudaHipRt.hpp but does not include explicitly alpaka/dev/DevHipRt.hpp or alpaka/dev/DevCudaRt.hpp will miss the getPreferredWarpSize() specialisations.

Basically this change makes DevUniformCudaHipRt.hpp an internal implementation detail, that other code shuld not use directly.

@psychocoderHPC do you think this is a problem ?

Ok, I selected the shorter solution suggested by Andrea, @fwyzard . Specialisations in 2 different files is cleaner but needed many changes.

alpaka-group / alpaka

Fix slow getWarpSize problem #2246