Unit tests do not work on Intel GPUs

mirenradia commented 9 months ago

Summary

The unit tests do not work on Intel GPUs. I think there seems to be some incompatibility between Catch2 and the Intel DPC++ compiler's SYCL implementation.

Steps to reproduce

Here are some steps to reproduce on Dawn at its state on 2024-02-26 (likely to change soon):

SSH to Dawn

Clone AMReX:

git clone https://github.com/AMReX-Codes/amrex.git

Clone this repository:

git clone https://github.com/GRTLCollaboration/GRTeclyn.git

Start an interactive job (it is not currently possible to build on the login nodes).

Load the required Intel modules:

module load intel-oneapi-compilers/2024.0.0/gcc/znjudqsi intel-oneapi-mkl/2024.0.0/oneapi/4n7ruz44

Change into the Tests directory:
```
cd GRTeclyn/Tests
```
Build with USE_SYCL=TRUE:
```
make -j <num jobs>  USE_SYCL=TRUE
```
Run the tests
```
./Tests3d.sycl.ex
```

Observed outcome

The tests abort with the following error:

amrex::Abort::0::ParallelFor: Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)!!!!! !!!
SIGABRT
See Backtrace.0 file for details

Expected outcome

The tests should work without issues.

Additional information

I think there are several compounding issues here:

The Intel DPC++ compiler has trouble linking unnamed device kernels in Catch2 test cases. See intel/llvm#10659.

Even if I remove all but one of the test cases with amrex::ParallelFors such as the "CCZ4 RHS" test case by modifying test_dirs in Tests/GNUMakefile to just
```
test_dirs = $(GRTECLYN_TESTS_HOME)/Common \
           $(GRTECLYN_TESTS_HOME)/CCZ4RHSTest
```
(this test contains multiple amrex::ParallelFors), I get the following error
```
amrex::Abort::0::ParallelFor: Native API failed. Native API returns: -46 (PI_ERROR_INVALID_KERNEL_NAME)!!!!! !!!
SIGABRT
See Backtrace.0 file for details
```
Interestingly, building with DEBUG=TRUE does allow this single test to pass.
Intel's Level Zero runtime uses SIGSEGV (i.e. a segfault) to trigger migration of memory between host and device when using USM shared allocations. We need to disable Catch2's POSIX signal handling to get around this (note that this is automatically done for AMReX's SIGSEGV handling when running on Intel GPU's - see here). It should be sufficient to simply define the macro CATCH_CONFIG_NO_POSIX_SIGNALS.

mirenradia commented 9 months ago

Unfortunately switching to doctest (#47) does not seem to completely solve this. On b340cd1edaa5ba345b2100a4e3384e8a705e458b, the tests still prematurely abort with the same error code. However, they pass with DEBUG=TRUE.

Playing around with the optimization flags set in amrex/Tools/GNUMake/comps/dpcpp.make:L35 gives the change from -O1 to -O2 as the culprit (i.e. tests pass with -O1 and prematurely abort with -O2). I believe -O2 also implies -fvectorize -fslp-vectorize -fsycl-dead-args-optimization (using this stackoverflow answer) but the tests pass with -O1 -fvectorize -fslp-vectorize -fsycl-dead-args-optimization and fail with -O2 suggesting it must be something else enabled with -O2...

mirenradia commented 9 months ago

Running this through ze_tracer -c --demangle on b340cd1edaa5ba345b2100a4e3384e8a705e458b gives the following Level Zero call that fails:

>>>> [17719443688] zeKernelCreate: hModule = 0x7353da0 desc = 0x7ffee80ad960 {ZE_STRUCTURE_TYPE_KERNEL_DESC(0x1d) 0 0 "_ZTSZZN5amrex11ParallelForILi256EZ17run_ccz4_rhs_testvEUliiiE0_EEvRKNS_3Gpu10KernelInfoERKNS_3BoxEOT0_ENKUlRN4sycl3_V17handlerEE0_clESE_EUlNSC_7nd_itemILi1EEEE_" (amrex::ParallelFor<256, run_ccz4_rhs_test()::{lambda(int, int, int)#2}>(amrex::Gpu::KernelInfo const&, amrex::Box const&, run_ccz4_rhs_test()::{lambda(int, int, int)#2}&&)::{lambda(sycl::_V1::handler&)#2}::operator()(sycl::_V1::handler&) const::{lambda(sycl::_V1::nd_item<1>)#1})} phKernel = 0x7ffee80ad958 (hKernel = 0x1545e1c8f0d0)
<<<< [17719467271] zeKernelCreate [2270 ns] -> ZE_RESULT_ERROR_INVALID_KERNEL_NAME(0x2013265937)

mirenradia commented 8 months ago

Using the -save-temps option and inspecting the integration headers and preprocessed *.ii files shows no difference between -O1 and -O2 other than the short hash.

Enabling shader dumps shows that 10 different *.spv files (SPIR-V binaries) are created with -O1 but only 3 are created for -O2 so it seems some device kernels are not being compiled in the latter case.

mirenradia commented 7 months ago

Since there is a new software stack on Dawn (dawn-env/2024-04-15) with oneAPI 2024.1, I thought I would try this again but I still run into the same problem.

GRTLCollaboration / GRTeclyn