Open Masterwater-y opened 4 months ago
ncmpi_create
internally calls MPI_File_open
.
Can you try this small MPI-IO program to see whether
the hanging is because of MPI-IO or PnetCDF?
https://github.com/wkliao/mpi-io-examples/blob/master/mpi_file_open.c
I'd also like to know which MPI implementation/version and which file system. Definitely strange that things work ok with 256 processes but not 300+.
I'd also like to know which MPI implementation/version and which file system. Definitely strange that things work ok with 256 processes but not 300+.
It's mpich-4.1.2 @roblatham00
ncmpi_create
internally callsMPI_File_open
. Can you try this small MPI-IO program to see whether the hanging is because of MPI-IO or PnetCDF? https://github.com/wkliao/mpi-io-examples/blob/master/mpi_file_open.c
I tried and it still hang, maybe it is because of MPI-IO @wkliao
What file system are you writing to? Are the MPICH versions the same on all the hosts, i.e. controller1,compute1,compute2storage,compute3storage?
Can you try adding "ufs:" as a prefix to your output file name, i.e. ufs:./output.nc?
@wkliao it works,thank you so much! The MPICH versions are the same on all the hosts.
@wkliao Unfortunately, the issue with the stuck creation of output.nc has reappeared. Strangely enough, when I use mpirun -n 256 -hosts controller1,compute1,compute2storage,compute3storage ./test ufs:output.nc
to run the program, everything works as expected. However, when I try to use a hostfile instead, mpirun -n 256 -f hostfile ./test ufs:output.nc
the problem arises.
The hostfile is like
controller1:64 compute1:64 compute2storage:64 compute3storage:64
The problem may be the file system you are using. What file system are you using to store file 'output.nc'?
I'm encountering an issue where the ncmpi_create function appears to stall when running my application with a high number of MPI processes. Specifically, the program hangs at the ncmpi_create call when attempting to create a new NetCDF file.
My Netcdf version is 1.12.1, FLAGS as below
I executed the command below and it will stall at ncmpi_create, there are 4 nodes and each node has 96 cores
mpirun -n 384 -hosts controller1,compute1,compute2storage,compute3storage ./test ./output.nc
if I reduce the number of rank, like mpirun -n 256, it can work. I want to know what might be causing this, whether it's a network bottleneck or a disk bottleneck, or OS options
My code