Nek5000 / nekRS

our next generation fast and scalable CFD code
https://nek5000.mcs.anl.gov/
Other
284 stars 75 forks source link

Segmentation fault trying to run nekrs with mpi #464

Closed joneuhauser closed 2 years ago

joneuhauser commented 2 years ago

Describe the bug Running the example program (turbPipePeriodic) with NekRS works only without MPI.

To Reproduce

mpirun -np 2 nekrs --setup turbPipe.par

The last output printed to console is

meshParallelGatherScatterSetup N=7
timing gs modes: 7.42e-05s 1.60e-04s 1.95e-04s 1.97e-04s 
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 324416 RUNNING AT mypc
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

and attaching GDB tells me

Thread 1 "nekrs" received signal SIGSEGV, Segmentation fault.
(gdb) bt
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:436
Backtrace stopped: Cannot access memory at address 0x7ffd41510fb8
(gdb) 

The example works fine if I just run nekrs --setup turbPipe.par.

Expected behavior

The example works.

Version information:

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 
mpich 4.0.2 built with GCC
nvcc: Cuda compilation tools, release 11.8, V11.8.89
nvidia driver: 520.61.05
nekrs: 8ee9c381 (current master)
cmake version 3.16.3
stgeke commented 2 years ago

I am afraid this is not a nekRS specific issue.

stgeke commented 2 years ago

I think it crashes because there is no GPU aware MPI available. You can turn it off (e.g. export NEKRS_GPU_MPI=0) but this will introduce a performance regression.

joneuhauser commented 2 years ago

Thanks for the hint! I got it working with CUDA-enabled mvapich 2.3.7. With openmpi 4.1.4 (cuda enabled) NekRS doesn't seem to notice that there are multiple ranks - output looks something like this

Output ``` :~/.local/nekrs/examples/turbPipePeriodic$ mpirun -np 2 nekrs --setup turbPipe.par __ ____ _____ ____ ___ / /__ / __ \/ ___/ / __ \ / _ \ / //_// /_/ /\__ \ / / / // __// ,< / _, _/___/ / /_/ /_/ \___//_/|_|/_/ |_|/____/ v22.0.0 (8ee9c381) COPYRIGHT (c) 2019-2022 UCHICAGO ARGONNE, LLC MPI tasks: 1 __ ____ _____ ____ ___ / /__ / __ \/ ___/ / __ \ / _ \ / //_// /_/ /\__ \ / / / // __// ,< / _, _/___/ / /_/ /_/ \___//_/|_|/_/ |_|/____/ v22.0.0 (8ee9c381) COPYRIGHT (c) 2019-2022 UCHICAGO ARGONNE, LLC MPI tasks: 1 reading par file ... reading par file ... ```

but this issue exists also with Nek5000 (a simple MPI hello world shows the correct number of ranks though, and I haven't had that issue with other numerical codes), which is why I was using mpich.