Open BeiNing-Z opened 4 months ago
Hi @BeiNing-Z, could you provide a bit more information on what you are running.
嗨,您能否提供有关您正在运行的内容的更多信息。
- 哪个案例文件
- 哪个网格
- 多少个 MPI 实例
- 您是否正在使用设备,如果是的话?
my casefile is :
&NEKO_CASE mesh_file= '32768.nmsh' fluid_scheme='pnpn' lx = 8 source_term = 'noforce' initial_condition = 'user' / &NEKO_PARAMETERS dt = 1d-3 T_end = 2 nsamples = 1 write_at_end = .false. output_chkp = .false. uinf= 0.0,0.0,0.0 Re = 1600 pc_vel = 'jacobi' pc_prs = 'hsmg' proj_prs_dim = 8 dealias = .true. /
and 32 mpi rank
Is Neko compiled with support for a GPU? Because it seems the code is attempting to access a GPU where it fails.
Neko 是否支持 GPU 进行编译?因为代码似乎正在尝试访问失败的 GPU。
I use the way as https://github.com/ExtremeFLOW/neko/discussions/540#discussion-4113611 and I configure as the following ./configure --prefix=/home/neko/GPU/neko-install --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/cuda/ CUDA_CFLAGS=-O3 CUDA_ARCH=-arch=sm_70 NVCC=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/compilers/bin/nvcc --enable-device-mpi
Alright, so the problem is that you are trying to run the code with 32 GPU's. Try to adjust the number of MPI ranks to the number of GPU's you have available on your system.
Did the issues goes away when using more MPI ranks?
Otherwise, another thing to check is MPI. Since you built Neko for device-aware MPI, you might, depending on your MPI provider, have to set an env. variable.
@BeiNing-Z what is the status on this one?
when I mpirun the case, The following error occurs:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
0 0x7f5655178860 in ???
1 0x7f5655177a05 in ???
2 0x7f5654e5ed8f in ???
3 0x7f5654ed46c0 in ???
4 0x7f5654c883be in ???
5 0x7f5654001538 in ???
6 0x7f5655470b82 in ???
7 0x4516ab in ???
8 0x449b1b in __device_math_MOD_device_glsum
9 0x4877e5 in coef_generate_mass
10 0x48e207 in __coefs_MOD_coef_init_all
11 0x42e45b in __fluid_scheme_MOD_fluid_scheme_init_common
12 0x42f8ae in __fluid_scheme_MOD_fluid_scheme_init_all
13 0x4f204a in __fluid_pnpn_MOD_fluid_pnpn_init
14 0x4df232 in __case_MOD_case_init
15 0x44e3f2 in __neko_MOD_neko_init
16 0x408636 in usrneko
17 0x40497c in main
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node node4 exited on signal 11 (Segmentation fault).
my mpirun version is mpirun (Open MPI) 4.1.6, the neko version is 0.6.1 could you help me solve it?