Open tdd11235813 opened 6 years ago
I am not sure if I understand the problem correctly, but if you want to call a host-only method (__builtin_popcount
) within a kernel, you can not declared the kernel/function __host__ __device__
with nvcc. Clang is much more intelligent and allows this as long as the kernel/function is never called within a __device__
context.
There is an example for the other direction calling a device only function within a kernel. This is only possible when declaring the kernel method as ALPAKA_FN_ACC
. In your case you will have to declare the method as ALPAKA_FN_HOST
or use the preprocesor to select the correct version.
When using ALPKA_FN_HOST
for defining a CPU only intrinsic, we make assumptions about the behavior of the CPU back end. We assume that the used calculation hardware is in fact the same as the host. Furthermore there is not opposite of ALPKA_FN_HOST
for defining a dummy version for GPU code generation. ALPKA_FN_ACC
also defines __host__
on mixed accelerator code and ALPAKA_FN_ACC_CUDA_ONLY
does not work for non-CUDA-GPU accelerators.
Using ALPAKA_ACC_GPU_CUDA_ONLY_MODE
is not really a solution if I want to use the same kernel on CPU and GPU at the same time using different e.g. bitcount specializations. In that case I need to define a dummy __device__
function implicitly like this
template<>
struct Bitcount<alpaka::acc::AccCpuSerial<alpaka::dim::DimInt<1u>, size_t> >
{
ALPAKA_FN_ACC
auto operator()(uint_t v) -> uint_t
{
#if __CUDA_ARCH__ != 0
return 0;
#else
return __builtin_popcount(v);
#endif
}
};
However the CPU accelerator should not check for CUDA as this CPU code will now fail e.g. for AMD GPU code, which does not make much sense. So I suggest a new alpaka macro like
#define ALPAKA_STAGE_1_VERSION __CUDA_ARCH__
(Of course with a different implementation for AMD GPU compilers and set to 0 for CPU only compilations) Then the CPU code could check for non-CPU stages of the compilation and disable intrinsics and/or create dummy code, which will never be used because of the template specialization anyway.
I think ALPAKA_STAGE_1_VERSION
is still not very self contained (where do I know that stage 1 is device or host compiler?):
#define ALPAKA_STAGE_1 (__CUDA_ARCH__!=0)
//...
template<>
struct Bitcount<alpaka::acc::AccCpuSerial<alpaka::dim::DimInt<1u>, size_t> > {
ALPAKA_FN_ACC
auto operator()(uint_t v) -> uint_t
{
#if ALPAKA_STAGE_1 // <-- dont see that i have to place stuff into else branch
return 0;
#else
return __builtin_popcount(v);
#endif
}
};
The average client like me only knows whether his code runs on CPU/host or not. Thus, I would like to see something like this:
// --- how macro def could look like ---
#ifdef __CUDA_ARCH__
#define ALPAKA_STAGE_HOST (__CUDA_ARCH__==0)
#elif defined(__HIP_ARCH__)
// see https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_porting_guide.md#hip_arch-defines
#define ALPAKA_STAGE_HOST (__HIP_ARCH__==0)
#else
#define ALPAKA_STAGE_HOST (1)
#endif
#define ALPAKA_STAGE_DEVICE !ALPAKA_STAGE_HOST
// --- Host specific part ----
template<>
struct Bitcount<alpaka::acc::AccCpuSerial<alpaka::dim::DimInt<1u>, size_t> > {
ALPAKA_FN_ACC
auto operator()(uint_t v) -> uint_t
{
#if ALPAKA_STAGE_HOST
return __builtin_popcount(v);
#else
return 0;
#endif
}
};
// --- Device-specific part ---
template<>
struct Bitcount<alpaka::acc::AccGpuCudaRt<alpaka::dim::DimInt<1u>, size_t> >{
ALPAKA_FN_ACC
auto operator()(uint_t v) -> uint_t
{
#if ALPAKA_STAGE_DEVICE
return __popc(v);
#else
return 0;
#endif
}
};
I know the last part is also achievable with less code by using ALPAKA_ACC_GPU_CUDA_ONLY_MODE
, but this way might help to follow a common code pattern when using platform-specific functions?
Btw, on my other test platform (taurus) nvcc only prints a warning, when calling a host function within a device function. It compiles just fine. I have loaded the same modules, but maybe paths were different as there are different OS on the test platforms.
Sounds reasonable if there are more offloading architectures then CUDA supported. Just the whole "stage" naming is a confusing and unclear.
Why not instead of the CUDA-only macro ALPAKA_FN_ACC_CUDA_ONLY
define an additional or ALPAKA_FN_ACC_OFFLOADING_ONLY
that triggers for any kind of "GPU/offloading" device instead of CUDA only?
Or call it ALPAKA_IS_HOST
or ALPAKA_IS_ACC
.
Btw, if you are looking for the full list of what is allowed, warning and erroring on device->host and vice versa, please see: https://github.com/ComputationalRadiationPhysics/picongpu/pull/1707#issuecomment-267571760 (note: also see René's comment below mine!)
Yeah, one of the two options you provided will make it. At the moment the following code actually works on my current test system (taurus, gcc5.3, boost 1.64.0, cuda8):
template<>
struct Bitcount<alpaka::acc::AccCpuSerial<alpaka::dim::DimInt<1u>, size_t> > {
ALPAKA_FN_HOST
auto operator()(uint_t v) -> uint_t
{
return __builtin_popcount(v);
}
};
template<>
struct Bitcount<alpaka::acc::AccGpuCudaRt<alpaka::dim::DimInt<1u>, size_t> >{
ALPAKA_FN_ACC_CUDA_ONLY // <-- yup, could be replaced by more generic like ...ACC_ONLY
auto operator()(uint_t v) -> uint_t // edit: ^ only if instruction is available on all device platforms
{
return __popc(v);
}
};
IIRC, this failed on my other test system, thus this issue. Alas, don't have access to that system at the moment :/ Curiously, clang 3.7+cuda8 now fails to compile (missing atomic include; even when provided manually, it still fails at templates ... ). Need more time and platforms to dig around.
Thanks for the link to the list of warning/errors of host|device traps, helpful overview.
Hi,
we discovered the following problem:
__host__
and__device__
attribute for the operator().__popc()
etc.So the main problem is:
__host__
__device__
__host__ __device__
kernel ->__host__
sub-function -> compiler error from above__host__ __device__
kernel ->__host__ __device__
sub-function -> intrinsic unknown (it's obvious, as I wanted acc-specific paths at template level)Possible workarounds:
Edit: style+better text