GRTLCollaboration / GRTeclyn

Port of GRChombo to AMReX - under development!
BSD 3-Clause "New" or "Revised" License
4 stars 2 forks source link

Unit tests do not work on AMD GPUs #48

Closed mirenradia closed 6 months ago

mirenradia commented 7 months ago

Summary

Like #46 for Intel GPUs, there seems to be problems with our unit tests on AMD GPUs at least on the version of ROCm (5.1.0) and specific GPU that I tried (I think MI210).

Steps to reproduce

Here are some steps to reproduce on the AMD GPU nodes on COSMA.

  1. SSH to COSMA8
  2. Clone AMReX:
    git clone https://github.com/AMReX-Codes/amrex.git
  3. Clone this repo:
    git clone https://github.com/GRTLCollaboration/GRTeclyn.git
  4. SSH to ga005 (which is running ROCm v5.1.0 at the time of writing)
  5. ROCm 5.1.0 is already in the system path so there is no need to load any modules for it but it is necessary to load a newer version of GCC and tell hipcc to use it. First load the module
    module load gnu_comp/11.1.0
  6. Add the following lines to ~/amrex/Tools/GNUMake/Make.local (create this file if it doesn't exist):
    ifeq ($(USE_HIP),TRUE)
     GCC_PATH := $(shell realpath -m $(shell which gcc)/../..)
     GCC_VERSION := $(notdir $(GCC_PATH))
     CXXFLAGS += --gcc-toolchain=$(GCC_PATH)
     SYSTEM_INCLUDE_LOCATIONS += $(GCC_PATH)/include/c++/$(GCC_VERSION)
     AMREX_AMD_ARCH = gfx90a
    endif
  7. Change into the Tests directory:
    cd ~/GRTeclyn/Tests
  8. Build with USE_HIP=TRUE:
    make -j 128 USE_HIP=TRUE
  9. Run the tests
    make run USE_HIP=TRUE

Observed outcome

The tests abort with the following error:

:0:rocdevice.cpp            :2614: 1823894239190 us: 124592: [tid:0x2b2d6af70700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
SIGABRT
See Backtrace.0 file for details

Additional information

Passing -DCATCH_CONFIG_NO_COUNTER to the preprocessor which changes the way Catch2 internally names test cases uniquely from using the __COUNTER__ predefined macro to the __LINE__ one allows the tests to work (although the CCZ4 RHS test currently fails as the tolerances are currently too small).

Since the test cases are currently in different translation units (i.e. different .cpp files) which means __COUNTER__ ends up being 0 for each of them, the "unique" names are not unique and the ROCm linker has difficulty figuring out which device code to link to each kernel in test cases. Since TEST_CASE is currently on a different line for each test (although we can't guarantee that going forward), __LINE__ ends up being different and the test cases are uniquely named.

mirenradia commented 7 months ago

The tests work on b340cd1edaa5ba345b2100a4e3384e8a705e458b but the CCZ4 RHS one fails as the tolerances are too tight.