Is OP2 support ARMV8 architecture

Qiyu8 commented 2 years ago

I want to use OP2 in my HPC apps, which mainly deployed in ARM based machine, is OP2 friendly enough to ARM?

gihanmudalige commented 2 years ago

@Qiyu8 we have compiled and run on ARM systems before and have worked with the University of Bristol who hosts UKs Isambard HPC system that is ARM based. If you run into any issue on your side, will be happy to help, so simply flag in an issue here.

Qiyu8 commented 2 years ago

@gihanmudalige Thanks for your quickly reply, My HPC app is compiled successfully with OP2, all of the binary file(such as xxx_cuda, xxx_mpi, xxx_mpi_openmp) works perfectly under X86 arch, but when I running xxx_mpi under ARM-based machine, The performance remains the same with sequance code, no matter how many cores used.

gihanmudalige commented 2 years ago

I am assuming you are running your code with mpirun -np #number of procs The MPI version needs careful handling as there is I/O (i.e. mesh generation or reading a mesh) to be considered. Can you give us a bit more detail on how this is done ?

Qiyu8 commented 2 years ago

The MPI version is openmpi-4.1.2, The mpi running command is : X86: mpirun --allow-run-as-root -np 8 ../bin/xxx_mpi -i input.dat ARM: mpirun --allow-run-as-root -np 8 -mca pml ucx -mca btl ^vader,tcp,openib,uct ../bin/xxx_mpi -i input.dat

gihanmudalige commented 2 years ago

Ok, so how is the application reading input.dat and how is data distributed to each MPI proc ? Have you looked at https://op2-dsl.readthedocs.io/en/latest/devapp.html the tutorial that details the issues regarding distributed memory execution ?

--update-- and sorry, forgot to ask, did you get the speedups you want with x86 runs ?

Qiyu8 commented 2 years ago

yes, The speedups in X86 is about 6x acceleration, which is perfect.

gihanmudalige commented 2 years ago

Interesting. So seems like you have the distribution of work to MPI procs done ok. I wonder if there is some ARM specif issue for OpenMPI. I assume you have already run our own Airfoil application and it also has the same issue on this ARM system ?

@reguly we had some of these runs on the Bristol machines, do you know what compiler and setup you used ?

reguly commented 2 years ago

I think if the X86 version is working as expected, then it's unlikely to be an issue with OPS itself (or the application), as we don't have any separate code paths for X86 and ARM, it's more likely a configuration issue. Does this ARM CPU have multiple hardware threads per core (i.e. SMT8 for example)? It's possible all the ranks get assigned to the same physical core, and you don't see a speedup.

lj-cug commented 2 years ago

Today I have run the airfoil case on the ARM machine, OK! Both the airfoil_openmp and airfoil_mpi can speedup the calculation. The wallclock time is 370.34 (s), 7.03(s), 27.16 (s) (using mpirun -np 16) for sequential, openmp and mpi cases running. Now the problem happened for MG-CFD case running, the mpi parallelism cannot function well. The difference between and airfoil and MGCFD is the multi-grid algorithm. When I run mpirun -n $ntask ../bin/mgcfd_mpi -i input.dat on the ARM machine, no mater how many ntask I set, the wallclock times are all the same with that in sequential mode. So I have to check the code of MGCFD-OP2. But the MPI version of MGCFD can function well in my X86 machine.

lj-cug commented 2 years ago

The most time consumption kernels are compute_flux_edge_kernel and also the simple kernel "get_min_dt_kernel" mgcfd_mpi_node=4,N=32

Qiyu8 commented 2 years ago

@reguly I checked all of the ranks, They are running under different cores.

reguly commented 2 years ago

Looking at the performance summary, note how "MPI time" is very large for a few loops, and "time" is only a bit larger. This means that these loops spend most of their time waiting on MPI communications, and very little time computing. So the issue is with MPI, or how its configured. I suspect if you look at the summary of airfoil, it will show similar (maybe not as extreme) numbers.

lj-cug commented 2 years ago

After trying several optional running, I found that MPI version application can speedup when I use "mpirun -np 32 ../bin/mgcfd_mpi -i input.dat -m parmetis -r kway" The domain decomposition algorithm in PARMETIS and PTSCOTCH affect the mpi running efficiency on ARM machine. I found only the kway option can speedup, another two options including geom (default) and geomkway cannot speedup calculation on ARM machine. But all the three options can speedup MPI app on X86 machine.

reguly commented 2 years ago

Interesting - I have seen parmetis + geom underperform, but it wasn't nearly this bad. What is the mesh you are using? Can you share timing results for kway? I will do a pull request to make kway the default for MG-CFD

reguly commented 2 years ago

Using Kway is now default in mg-cfd: https://github.com/warwick-hpsc/MG-CFD-app-OP2/pull/45

lj-cug commented 2 years ago

I have use Rotor_1M case to test the MGCFD-OP2-app on Huawei Kunpeng920 ARM machine, and now gives the initial report as following: app name Max total runtime (s) mgcfd_seq: 116.44 mgcfd_openmp: 6.89 (It seems that app only use the 32 cores in one NUMA domain) mgcfd_cuda: 4.23 (We use one A100 Nvidia GPU on the node) mgcfd_mpi: 3.05 (-np 64 ../bin/mgcfd_mpi -i input.dat -m parmetis -r kway) 109.60 (-np 64 ../bin/mgcfd_mpi -i input.dat -m parmetis -r geom) 108.78 (-np 64 ../bin/mgcfd_mpi -i input.dat -m parmetis -r geomkway) mgcfd_mpi_openmp: 158.42 (-np=1, OMP_NUM_THREADS=64) mgcfd_mpi_openmp: 144.99 (-np=1, OMP_NUM_THREADS=32) mgcfd_mpi_openmp: 3.28 (-np=64, OMP_NUM_THREADS=1) (Now we use the 64 cores in the 2 NUMA domains) mgcfd_mpi_openmp: 5.59 (-np=32, OMP_NUM_THREADS=1)

We can see that the domain decomposition algorithm affect the MPI parallelism efficiency obviously. But we don't see the effect in the Airfoil case testing in OP2 when we use all the three algorithms. The domain decompositon including geom and geomkway cannot adapt with multi-grid algorithm, even I use only one mesh layer in Rotor_1m case testing. In mpi_openmp parallelism mode, we only can obtain speedup when we use multi-ranks and 1 OMP_NUM_THREADS. It seems that MPI running just used the ranks corresponding to the cores in Kunpeng920 CPU. I also noticed the Mont-Blanc Project report, they tested many benchmarks in ThunderX2 ARM HPC cluster. The HPGMG app also used the Finite Volume Multi-grid algorithm, and the calculation efficiency is not good when comparing with another machine running.

reguly commented 2 years ago

Interesting stuff! This is a dual-socket machine, right? For MPI+OpenMP, you should try using two processes, and bind them each to their own socket. e.g. if you use OpenMPI: OMP_NUM_THREADS=32 mpirun --bind-to socket -np 2 ../bin/mgcfd_mpi_openmp Generally speaking, if you have a larger Rotor37 mesh, these differences should reduce.

lj-cug commented 2 years ago

Yes! We can speedup when we run "mpirun -np 2 --bind-to socket -x OMP_NUM_THREADS=32 ../bin/mgcfd_mpi_openmp -i input.dat -m parmetis -r kway" But we failed to speedup when we use "--bind-to core" instead of using "--bind-to socket" I should study the domain decomposition affecting the MPI communication on ARM machine when we use multigrid CFD. It's strange. There's no problem in the case of airfoil using simple spatial discretisation scheme.

OP-DSL / OP2-Common

Is OP2 support ARMV8 architecture #232