ExtremeFLOW / neko

/ᐠ. 。.ᐟ\ᵐᵉᵒʷˎˊ˗
https://neko.cfd/
Other
159 stars 28 forks source link

Mpirun Error #1158

Open BeiNing-Z opened 4 months ago

BeiNing-Z commented 4 months ago

when I mpirun the case, The following error occurs:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

0 0x7f5655178860 in ???

1 0x7f5655177a05 in ???

2 0x7f5654e5ed8f in ???

3 0x7f5654ed46c0 in ???

4 0x7f5654c883be in ???

5 0x7f5654001538 in ???

6 0x7f5655470b82 in ???

7 0x4516ab in ???

8 0x449b1b in __device_math_MOD_device_glsum

    at math/bcknd/device/device_math.F90:1286

9 0x4877e5 in coef_generate_mass

    at sem/coef.f90:953

10 0x48e207 in __coefs_MOD_coef_init_all

    at sem/coef.f90:319

11 0x42e45b in __fluid_scheme_MOD_fluid_scheme_init_common

    at fluid/fluid_scheme.f90:205

12 0x42f8ae in __fluid_scheme_MOD_fluid_scheme_init_all

    at fluid/fluid_scheme.f90:362

13 0x4f204a in __fluid_pnpn_MOD_fluid_pnpn_init

    at fluid/fluid_pnpn.f90:129

14 0x4df232 in __case_MOD_case_init

    at /home/neko/GPU/neko-0.6.1/src/case.f90:185

15 0x44e3f2 in __neko_MOD_neko_init

    at /home/neko/GPU/neko-0.6.1/src/neko.f90:207

16 0x408636 in usrneko

    at /home/neko/GPU/tgv_Re1600/usr_driver.f90:7

17 0x40497c in main

    at /home/neko/GPU/tgv_Re1600/usr_driver.f90:2

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 0 with PID 0 on node node4 exited on signal 11 (Segmentation fault).

my mpirun version is mpirun (Open MPI) 4.1.6, the neko version is 0.6.1 could you help me solve it?

timfelle commented 4 months ago

Hi @BeiNing-Z, could you provide a bit more information on what you are running.

BeiNing-Z commented 4 months ago

嗨,您能否提供有关您正在运行的内容的更多信息。

  • 哪个案例文件
  • 哪个网格
  • 多少个 MPI 实例
  • 您是否正在使用设备,如果是的话?

my casefile is :

&NEKO_CASE mesh_file= '32768.nmsh' fluid_scheme='pnpn' lx = 8 source_term = 'noforce' initial_condition = 'user' / &NEKO_PARAMETERS dt = 1d-3 T_end = 2 nsamples = 1 write_at_end = .false. output_chkp = .false. uinf= 0.0,0.0,0.0 Re = 1600 pc_vel = 'jacobi' pc_prs = 'hsmg' proj_prs_dim = 8 dealias = .true. /

and 32 mpi rank

timfelle commented 4 months ago

Is Neko compiled with support for a GPU? Because it seems the code is attempting to access a GPU where it fails.

BeiNing-Z commented 4 months ago

Neko 是否支持 GPU 进行编译?因为代码似乎正在尝试访问失败的 GPU。

I use the way as https://github.com/ExtremeFLOW/neko/discussions/540#discussion-4113611 and I configure as the following ./configure --prefix=/home/neko/GPU/neko-install --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/cuda/ CUDA_CFLAGS=-O3 CUDA_ARCH=-arch=sm_70 NVCC=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc --enable-device-mpi

timfelle commented 4 months ago

Alright, so the problem is that you are trying to run the code with 32 GPU's. Try to adjust the number of MPI ranks to the number of GPU's you have available on your system.

njansson commented 4 months ago

Did the issues goes away when using more MPI ranks?

Otherwise, another thing to check is MPI. Since you built Neko for device-aware MPI, you might, depending on your MPI provider, have to set an env. variable.

timofeymukha commented 2 months ago

@BeiNing-Z what is the status on this one?