alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
356 stars 74 forks source link

fix and update SYCL targets #2390

Closed AuroraPerego closed 2 months ago

AuroraPerego commented 2 months ago

Now all device targets are defined as 0, while the one(s) we are compiling for are defined as 1. Therefore # if defined cannot be used anymore.

I've added some new targets as well.

Note that there is probably a bug with NVIDIA targets, but we are not using them for now.

fwyzard commented 2 months ago

@psychocoderHPC the error in mathTest (link, attached log) should be unrelated to this PR:

Randomness seeded to: 2485158815
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mathTest is a Catch2 v3.5.2 host application.
Run with -? for options
-------------------------------------------------------------------------------
mathOpsComplexFloat - TestAccFunctorTuplesComplex - 49
-------------------------------------------------------------------------------
/builds/hzdr/crp/alpaka/test/unit/math/src/mathComplexFloat.cpp:26
...............................................................................
/builds/hzdr/crp/alpaka/test/unit/math/src/TestTemplate.hpp:145: FAILED:
  REQUIRE( isApproxEqual(results(i), std_result) )
with expansion:
  false
with messages:
  testing acc:alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::
  integral_constant<unsigned long, 1ul>, unsigned long> data type:alpaka::
  internal::Complex<float> functor:mathtest::OpPowComplex seed:4140714306
  Operator: OpPowComplex
  Type: alpaka::internal::Complex<float>
  The args buffer: 
  capacity: 1000
  0: [ (1e+01,1e+01), (0,0) ]
  ...
  999: [ (2,2), (-0.8,-0.3) ]

  Idx i: 1 computed : (506.109,-1.62104e+06) vs expected: (509.202,-1.62104e+
  06)
===============================================================================
test cases:    278 |    277 passed | 1 failed
assertions: 267086 | 267085 passed | 1 failed

Any ideas what may have caused it or how to reproduce it ?

psychocoderHPC commented 2 months ago

I retriggered the failing test

psychocoderHPC commented 2 months ago

After reviewing the behaviour of HIP with @AuroraPerego, we think that all AMD GPUs starting from gfx 10.0 should support only a subgroup size of 32.

Yes I checked it against https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html#accelerator-and-gpu-hardware-specifications (the table has different tabs)

and AMD is writing https://rocm.docs.amd.com/projects/HIP/en/latest/understand/hardware_implementation.html#rdna-architecture

RDNA makes a fundamental change to CU design, by changing the size of a warp to 32 threads. 
This is done by effectively combining two GCN5 SIMDs, creating a VALU of width 32, 
so that a whole warp can be issued in one cycle. 
The CU is also replaced by the work group processor (WGP), 
which encompasses two CUs. For backwards compatibility the WGP can also run in wave64 mode, 
in which it issues a warp of size 64 in two cycles.
fwyzard commented 2 months ago

Yes... the RDNA1, RDNA2 and RDNA3 architecture whitepapers suggest that both 32 and 64 can be used.

And clang++ does have an -mwavefrontsize64 option to enable wavefronts (warps) with 64 elements.

But HIP does not like it:

/opt/rocm/include/hip/hip_runtime.h
#if __HIP_DEVICE_COMPILE__ && !__GFX7__ && !__GFX8__ && !__GFX9__ && __AMDGCN_WAVEFRONT_SIZE == 64
#error HIP is not supported on the specified GPU ARCH with wavefront size 64
#endif

I assume that the ROCm backend for SYCL/oneAPI is built with HIP, so on gfx 10 and later we should use only a wavefront size of 32.