alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
354 stars 73 forks source link

Host/Device kernel issues with accelerator-specific intrinsics and two-stage compiler (nvcc, ..) #457

Open tdd11235813 opened 6 years ago

tdd11235813 commented 6 years ago

Hi,

we discovered the following problem:

So the main problem is:

Possible workarounds:

Edit: style+better text

BenjaminW3 commented 6 years ago

I am not sure if I understand the problem correctly, but if you want to call a host-only method (__builtin_popcount) within a kernel, you can not declared the kernel/function __host__ __device__ with nvcc. Clang is much more intelligent and allows this as long as the kernel/function is never called within a __device__ context. There is an example for the other direction calling a device only function within a kernel. This is only possible when declaring the kernel method as ALPAKA_FN_ACC. In your case you will have to declare the method as ALPAKA_FN_HOST or use the preprocesor to select the correct version.

theZiz commented 6 years ago

When using ALPKA_FN_HOST for defining a CPU only intrinsic, we make assumptions about the behavior of the CPU back end. We assume that the used calculation hardware is in fact the same as the host. Furthermore there is not opposite of ALPKA_FN_HOST for defining a dummy version for GPU code generation. ALPKA_FN_ACC also defines __host__ on mixed accelerator code and ALPAKA_FN_ACC_CUDA_ONLY does not work for non-CUDA-GPU accelerators.

Using ALPAKA_ACC_GPU_CUDA_ONLY_MODE is not really a solution if I want to use the same kernel on CPU and GPU at the same time using different e.g. bitcount specializations. In that case I need to define a dummy __device__ function implicitly like this

template<>
struct Bitcount<alpaka::acc::AccCpuSerial<alpaka::dim::DimInt<1u>, size_t> >
{
    ALPAKA_FN_ACC
    auto operator()(uint_t v) -> uint_t
    {
        #if __CUDA_ARCH__ != 0
           return 0;
        #else
            return __builtin_popcount(v);
        #endif
    }
};

However the CPU accelerator should not check for CUDA as this CPU code will now fail e.g. for AMD GPU code, which does not make much sense. So I suggest a new alpaka macro like

#define ALPAKA_STAGE_1_VERSION __CUDA_ARCH__

(Of course with a different implementation for AMD GPU compilers and set to 0 for CPU only compilations) Then the CPU code could check for non-CPU stages of the compilation and disable intrinsics and/or create dummy code, which will never be used because of the template specialization anyway.

tdd11235813 commented 6 years ago

I think ALPAKA_STAGE_1_VERSION is still not very self contained (where do I know that stage 1 is device or host compiler?):

#define ALPAKA_STAGE_1 (__CUDA_ARCH__!=0)
//...
template<>
struct Bitcount<alpaka::acc::AccCpuSerial<alpaka::dim::DimInt<1u>, size_t> > {
    ALPAKA_FN_ACC
    auto operator()(uint_t v) -> uint_t
    {
#if ALPAKA_STAGE_1 // <-- dont see that i have to place stuff into else branch
      return 0;
#else
      return __builtin_popcount(v);
#endif
    }
};

The average client like me only knows whether his code runs on CPU/host or not. Thus, I would like to see something like this:

// --- how macro def could look like ---
#ifdef __CUDA_ARCH__
#define ALPAKA_STAGE_HOST (__CUDA_ARCH__==0)
#elif defined(__HIP_ARCH__)
// see https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_porting_guide.md#hip_arch-defines
#define ALPAKA_STAGE_HOST (__HIP_ARCH__==0)
#else
#define ALPAKA_STAGE_HOST (1)
#endif
#define ALPAKA_STAGE_DEVICE !ALPAKA_STAGE_HOST
// --- Host specific part ----
template<>
struct Bitcount<alpaka::acc::AccCpuSerial<alpaka::dim::DimInt<1u>, size_t> > {

    ALPAKA_FN_ACC
    auto operator()(uint_t v) -> uint_t
    {
#if ALPAKA_STAGE_HOST
        return __builtin_popcount(v);
#else
        return 0;
#endif
    }
};
// --- Device-specific part ---
template<>
struct Bitcount<alpaka::acc::AccGpuCudaRt<alpaka::dim::DimInt<1u>, size_t> >{

    ALPAKA_FN_ACC
    auto operator()(uint_t v) -> uint_t
    {
#if ALPAKA_STAGE_DEVICE
        return __popc(v);
#else
        return 0;
#endif
    }
};

I know the last part is also achievable with less code by using ALPAKA_ACC_GPU_CUDA_ONLY_MODE, but this way might help to follow a common code pattern when using platform-specific functions?

Btw, on my other test platform (taurus) nvcc only prints a warning, when calling a host function within a device function. It compiles just fine. I have loaded the same modules, but maybe paths were different as there are different OS on the test platforms.

ax3l commented 6 years ago

Sounds reasonable if there are more offloading architectures then CUDA supported. Just the whole "stage" naming is a confusing and unclear.

Why not instead of the CUDA-only macro ALPAKA_FN_ACC_CUDA_ONLY define an additional or ALPAKA_FN_ACC_OFFLOADING_ONLY that triggers for any kind of "GPU/offloading" device instead of CUDA only?

Or call it ALPAKA_IS_HOST or ALPAKA_IS_ACC.

Btw, if you are looking for the full list of what is allowed, warning and erroring on device->host and vice versa, please see: https://github.com/ComputationalRadiationPhysics/picongpu/pull/1707#issuecomment-267571760 (note: also see René's comment below mine!)

tdd11235813 commented 6 years ago

Yeah, one of the two options you provided will make it. At the moment the following code actually works on my current test system (taurus, gcc5.3, boost 1.64.0, cuda8):

template<>
struct Bitcount<alpaka::acc::AccCpuSerial<alpaka::dim::DimInt<1u>, size_t> > {
    ALPAKA_FN_HOST
    auto operator()(uint_t v) -> uint_t
    {
        return __builtin_popcount(v);
    }
};
template<>
struct Bitcount<alpaka::acc::AccGpuCudaRt<alpaka::dim::DimInt<1u>, size_t> >{
    ALPAKA_FN_ACC_CUDA_ONLY // <-- yup, could be replaced by more generic like ...ACC_ONLY
    auto operator()(uint_t v) -> uint_t // edit: ^ only if instruction is available on all device platforms
    {
        return __popc(v);
    }
};

IIRC, this failed on my other test system, thus this issue. Alas, don't have access to that system at the moment :/ Curiously, clang 3.7+cuda8 now fails to compile (missing atomic include; even when provided manually, it still fails at templates ... ). Need more time and platforms to dig around.

Thanks for the link to the list of warning/errors of host|device traps, helpful overview.