Open mirenradia opened 2 weeks ago
Unfortunately, performance on >1 GPUs is severely degraded with the v2024.1 of the Intel oneAPI software stack. For now, I advise downgrading to the following modules:
1) dot 5) rhel8/global 9) intel-oneapi-mkl/2024.0.0/oneapi/4n7ruz44
2) rhel8/slurm 6) default-dawn 10) intel-oneapi-mpi/2021.11.0/oneapi/h7nq7sah
3) dawn-env/2023-12-22(default) 7) intel-oneapi-compilers/2024.0.0/gcc/znjudqsi 11) intel-oneapi-dpl/2022.3.0/oneapi/p4oxz76c
4) dawn-env/2024-04-15 8) intel-oneapi-tbb/2021.11.0/oneapi/xtkj6nyp 12) intel-oneapi-inspector/2024.0.0/oneapi/a6zqe3ll
Summary
I have tested running the BinaryBH example on Dawn which contains Intel Data Center GPU Max 1550s (codename: 'Ponte Vecchio', often abbreviated to 'PVC').
Resources
Useful information can be found in the following sources:
GPU Architecture
The PVCs on Dawn are comprised of 2 stacks (previously referred to as "tiles" which is still common amongst Intel documentation). Each stack is effectively a separate GPU except that the 2 stacks can share GPU memory and can communicate with each other with a relatively high bandwidth (16 Xe links at 26.5 GB/s in each direction - see here for further details).
Additional information can be found in the Xe Architecture section of the oneAPI GPU Optimization Guide.
At the time of writing, the PVCs on Dawn are using Level Zero driver version 1.3.26516 (as reported by
sycl-ls --verbose
)Empirically, the best performance is achieved when there is 1 MPI process (or rank) per stack (i.e. 2 MPI ranks in the case of 1 GPU).
Software Environment
All the tests below use the following modules:
GPU-aware MPI
To enable passing GPU buffers directly to MPI calls it is necessary to set the following Intel MPI environment variable:
Furthermore, it is necessary to enable it at the AMReX level by setting the following parameter
either as an argument on the command line or in the parameter file.
GPU Pinning
Currently, SLURM interferes with the GPU pinning support provided in the Intel MPI Library as it sets the
ZE_AFFINITY_MASK
environment variable automatically. We will need to unset this:For now, we will restrict to a single node so we can change the MPI bootstrap server from
slurm
tofork
(which will only work intra-node) by passing-bootstrap fork
tompiexec
e.g.:We can also set the
ZE_FLAT_DEVICE_HIERARCHY
environment variable toFLAT
which will expose the stacks as separate devices to programs. With the current GPU drivers, the default isCOMPOSITE
but this should change toFLAT
with later driver versions. In any case, we can set it explicitly:We can also set the interface for GPU topology recognition by Intel MPI as follows (this should be done automatically if
I_MPI_OFFLOAD = 1
):GPU pinning should work automatically if the above advice has been followed but if not, can be explicitly enabled with
One can get Intel MPI to print the pinning topology by either setting
or
The aim is to see that a single tile (or stack) is pinned to each rank e.g.
I have yet to explore how best to do GPU pinning in the case of multi-node jobs but it may require some kind of wrapper script.
Ahead-Of-Time compilation
By default, building with Intel GPU support via SYCL (
USE_SYCL = TRUE
) uses Just-in-Time (JIT) compilation where code for a specific device is not produced at build time but rather the SYCL code is compiled down to an intermediate representation (IR) and then is compiled on-the-fly at runtime when the specfic device is known. Whilst this saves the user from having to figure out how to target a specific device, this increases runtimes and makes performance comparison (particularly with other GPU backends) inaccurate. Since we know we will be targetting PVCs we can instead do Ahead-of-Time (AOT) compilation. This can be enabled by adding the following code to yourMake.local-pre
file under/path/to/amrex/Tools/GNUMake
:Note that since AOT compilation happens at link time, it can be quite slow hence the
SYCL_PARALLEL_LINK_JOBS
option which allows this device code compilation to occur in parallel. This should not interfere withmake
's parallel build jobs option (-j 24
) because linking must occur after all other files are compiled. Since it is currently only possible to use the Dawn software stack on the PVC compute nodes, I have assumed you have requested at least 24 cores (corresponding to $1/4$ of the available cores on a node assuming you request at least 1 GPU)Register Pressure
There is a useful section Registers and Performance in the Intel oneAPI GPU Optimization Guide.
Empirically, it is found that the BinaryBH example (and most likely the CCZ4 RHS) exhibits kernels with high register pressure and this can be observed at AOT device compilation time. With the current AMReX defaults, you will see lots of warnings about register spills like the following:
Register spills severely hurt performance and should be avoided as far as possible. The size of the registers can be maximised by setting the following AMReX makefile options^1 (also in the
Make.local-pre
file in the<any other SYCL options>
block above).