Intel GPUs: Performance and Optimization on Dawn

Summary

I have tested running the BinaryBH example on Dawn which contains Intel Data Center GPU Max 1550s (codename: 'Ponte Vecchio', often abbreviated to 'PVC').

Resources

Useful information can be found in the following sources:

The Intel oneAPI GPU Optimization Guide.
The Intel MPI Library Developer Reference and in particular the section on GPU Support.
An Intel article on Options for using GPU Tile Hierarchy.

GPU Architecture

The PVCs on Dawn are comprised of 2 stacks (previously referred to as "tiles" which is still common amongst Intel documentation). Each stack is effectively a separate GPU except that the 2 stacks can share GPU memory and can communicate with each other with a relatively high bandwidth (16 Xe links at 26.5 GB/s in each direction - see here for further details).

Additional information can be found in the Xe Architecture section of the oneAPI GPU Optimization Guide.

At the time of writing, the PVCs on Dawn are using Level Zero driver version 1.3.26516 (as reported by sycl-ls --verbose)

Empirically, the best performance is achieved when there is 1 MPI process (or rank) per stack (i.e. 2 MPI ranks in the case of 1 GPU).

Software Environment

All the tests below use the following modules:

 1) dawn-env/2024-04-15   6) gcc-runtime/13.2.0/gcc/ayevhr77               11) intel-oneapi-tbb/2021.12.0/oneapi/pvsbvzxn
 2) dot                   7) zlib-ng/2.1.6/gcc/thn3ikgx                    12) intel-oneapi-mkl/2024.1.0/oneapi/xps7uyz6
 3) rhel8/global          8) zstd/1.5.5/gcc/7o3rooli                       13) intel-oneapi-mpi/2021.12.0/oneapi/nbxgtwyb
 4) rhel8/slurm           9) binutils/2.42/gcc/s65uixqt                    14) intel-oneapi-dpl/2022.5.0/oneapi/hbaogwjd
 5) default-dawn         10) intel-oneapi-compilers/2024.1.0/gcc/wadpqv2p  15) intel-oneapi-inspector/2024.1.0/oneapi/7aorc5hi

GPU-aware MPI

To enable passing GPU buffers directly to MPI calls it is necessary to set the following Intel MPI environment variable:

export I_MPI_OFFLOAD=1

Furthermore, it is necessary to enable it at the AMReX level by setting the following parameter

amrex.use_gpu_aware_mpi = 1

either as an argument on the command line or in the parameter file.

GPU Pinning

Currently, SLURM interferes with the GPU pinning support provided in the Intel MPI Library as it sets the ZE_AFFINITY_MASK environment variable automatically. We will need to unset this:

unset ZE_AFFINITY_MASK

For now, we will restrict to a single node so we can change the MPI bootstrap server from slurm to fork (which will only work intra-node) by passing -bootstrap fork to mpiexec e.g.:

mpiexec -boostrap fork <other mpi args> <program> <program args>

We can also set the ZE_FLAT_DEVICE_HIERARCHY environment variable to FLAT which will expose the stacks as separate devices to programs. With the current GPU drivers, the default is COMPOSITE but this should change to FLAT with later driver versions. In any case, we can set it explicitly:

export ZE_FLAT_DEVICE_HIERARCHY=FLAT

We can also set the interface for GPU topology recognition by Intel MPI as follows (this should be done automatically if I_MPI_OFFLOAD = 1):

export I_MPI_OFFLOAD_TOPOLIB=level_zero

GPU pinning should work automatically if the above advice has been followed but if not, can be explicitly enabled with

export I_MPI_OFFLOAD_PIN=1

One can get Intel MPI to print the pinning topology by either setting

export I_MPI_DEBUG=3

export I_MPI_OFFLOAD_PRINT_TOPOLOGY=1

The aim is to see that a single tile (or stack) is pinned to each rank e.g.

[0] MPI startup(): ===== GPU pinning on pvc-s-20 =====
[0] MPI startup(): Rank Pin tile
[0] MPI startup(): 0    {0}
[0] MPI startup(): 1    {1}

I have yet to explore how best to do GPU pinning in the case of multi-node jobs but it may require some kind of wrapper script.

Ahead-Of-Time compilation

By default, building with Intel GPU support via SYCL (USE_SYCL = TRUE) uses Just-in-Time (JIT) compilation where code for a specific device is not produced at build time but rather the SYCL code is compiled down to an intermediate representation (IR) and then is compiled on-the-fly at runtime when the specfic device is known. Whilst this saves the user from having to figure out how to target a specific device, this increases runtimes and makes performance comparison (particularly with other GPU backends) inaccurate. Since we know we will be targetting PVCs we can instead do Ahead-of-Time (AOT) compilation. This can be enabled by adding the following code to your Make.local-pre file under /path/to/amrex/Tools/GNUMake:

ifeq ($(USE_SYCL),TRUE)
  SYCL_AOT = TRUE
  AMREX_INTEL_ARCH = pvc
  SYCL_PARALLEL_LINK_JOBS = 24
  <any other SYCL options>
endif

Note that since AOT compilation happens at link time, it can be quite slow hence the SYCL_PARALLEL_LINK_JOBS option which allows this device code compilation to occur in parallel. This should not interfere with make's parallel build jobs option (-j 24) because linking must occur after all other files are compiled. Since it is currently only possible to use the Dawn software stack on the PVC compute nodes, I have assumed you have requested at least 24 cores (corresponding to $1/4$ of the available cores on a node assuming you request at least 1 GPU)

salloc --ntasks=2 --cpus-per-task=12 --partition=pvc --gres=gpu:1 ...

Register Pressure

There is a useful section Registers and Performance in the Intel oneAPI GPU Optimization Guide.

Empirically, it is found that the BinaryBH example (and most likely the CCZ4 RHS) exhibits kernels with high register pressure and this can be observed at AOT device compilation time. With the current AMReX defaults, you will see lots of warnings about register spills like the following:

warning: kernel _ZTSZZN5amrex6launchILi256EZNS_12experimental6detail11ParallelForILi256ENS_8MultiFabEZN13BinaryBHLevel15specificEvalRHSERS4_S6_dEUliiiiE0_EENSt9enable_ifIXsr10IsFabArrayIT0_EE5valueEvE4typeERKS9_RKNS_9IntVectNDILi3EEEiSH_bRKT1_EUlRKN4sycl3_V17nd_itemILi1EEEE_EEviNS_11gpuStream_tESD_ENKUlRNSM_7handlerEE_clESU_EUlSO_E_  compiled SIMD32 allocated 128 regs and spilled around 64

Register spills severely hurt performance and should be avoided as far as possible. The size of the registers can be maximised by setting the following AMReX makefile options^1 (also in the Make.local-pre file in the <any other SYCL options> block above).

  SYCL_SUB_GROUP_SIZE = 16
  SYCL_AOT_GRF_MODE = Large

GRTLCollaboration / GRTeclyn