Parallel-NetCDF / PnetCDF

Source code repository of PnetCDF library and utilities
https://parallel-netcdf.github.io
Other
80 stars 22 forks source link

ncmpi_create Stalls When Using High MPI Rank Counts #142

Open Masterwater-y opened 2 months ago

Masterwater-y commented 2 months ago

I'm encountering an issue where the ncmpi_create function appears to stall when running my application with a high number of MPI processes. Specifically, the program hangs at the ncmpi_create call when attempting to create a new NetCDF file.

My Netcdf version is 1.12.1, FLAGS as below

grep "CFLAGS" /home/yhl/green_suite/install/files/pnetcdf-1.12.1/Makefile

CFLAGS = -g -O2 -fPIC
CONFIGURE_ARGS_CLEAN = --prefix=/home/cluster-opt/pnetcdf --enable-shared --enable-fortran --enable-large-file-test CFLAGS="-g -O2 -fPIC" CXXFLAGS="-g -O2 -fPIC" FFLAGS="-g -fPIC" FCFLAGS="-g -fPIC" F90LDFLAGS="-fPIC" FLDFLAGS="-fPIC" LDFLAGS="-fPIC"
FCFLAGS = -g -fPIC
FCFLAGS_F = 
FCFLAGS_F90 = 
FCFLAGS_f = 
FCFLAGS_f90 =

I executed the command below and it will stall at ncmpi_create, there are 4 nodes and each node has 96 cores

mpirun -n 384 -hosts controller1,compute1,compute2storage,compute3storage ./test ./output.nc

if I reduce the number of rank, like mpirun -n 256, it can work. I want to know what might be causing this, whether it's a network bottleneck or a disk bottleneck, or OS options

My code

#include <stdlib.h>
#include <mpi.h>
#include <pnetcdf.h>
#include <stdio.h>

static void handle_error(int status, int lineno)
{
    fprintf(stderr, "Error at line %d: %s\n", lineno, ncmpi_strerror(status));
    MPI_Abort(MPI_COMM_WORLD, 1);
}

int main(int argc, char **argv) {

    int ret, ncfile, nprocs, rank, dimid1, dimid2, varid1, varid2, ndims;
    MPI_Offset start, count=1;
    int t, i;
    int v1_dimid[2];
    MPI_Offset v1_start[2], v1_count[2];
    int v1_data[4];
    char buf[13] = "Hello World\n";
    int data;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    if (argc != 2) {
        if (rank == 0) printf("Usage: %s filename\n", argv[0]);
        MPI_Finalize();
        exit(-1);
    }

    ret = ncmpi_create(MPI_COMM_WORLD, argv[1],
                       NC_CLOBBER, MPI_INFO_NULL, &ncfile);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    ret = ncmpi_def_dim(ncfile, "d1", nprocs, &dimid1);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    ret = ncmpi_def_dim(ncfile, "time", NC_UNLIMITED, &dimid2);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    v1_dimid[0] = dimid2;
    v1_dimid[1] = dimid1;
    ndims = 2;

    ret = ncmpi_def_var(ncfile, "v1", NC_INT, ndims, v1_dimid, &varid1);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    ndims = 1;

    ret = ncmpi_def_var(ncfile, "v2", NC_INT, ndims, &dimid1, &varid2);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    ret = ncmpi_put_att_text(ncfile, NC_GLOBAL, "string", 13, buf);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    /* all processors defined the dimensions, attributes, and variables,
     * but here in ncmpi_enddef is the one place where metadata I/O
     * happens.  Behind the scenes, rank 0 takes the information and writes
     * the netcdf header.  All processes communicate to ensure they have
     * the same (cached) view of the dataset */

    ret = ncmpi_enddef(ncfile);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    start=rank, count=1, data=rank;

    ret = ncmpi_put_vara_int_all(ncfile, varid2, &start, &count, &data);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    for (t = 0; t<2; t++){

        v1_start[0] = t, v1_start[1] = rank;
        v1_count[0] = 1, v1_count[1] = 1;
        for (i = 0; i<4; i++){
            v1_data[i] = rank+t;
        }

        /* in this simple example every process writes its rank to two 1d variables */
        ret = ncmpi_put_vara_int_all(ncfile, varid1, v1_start, v1_count, v1_data);
        if (ret != NC_NOERR) handle_error(ret, __LINE__);

    }

    ret = ncmpi_close(ncfile);
    if (ret != NC_NOERR) handle_error(ret, __LINE__);

    MPI_Finalize();

    return 0;
}
wkliao commented 2 months ago

ncmpi_create internally calls MPI_File_open. Can you try this small MPI-IO program to see whether the hanging is because of MPI-IO or PnetCDF? https://github.com/wkliao/mpi-io-examples/blob/master/mpi_file_open.c

roblatham00 commented 2 months ago

I'd also like to know which MPI implementation/version and which file system. Definitely strange that things work ok with 256 processes but not 300+.

Masterwater-y commented 1 month ago

I'd also like to know which MPI implementation/version and which file system. Definitely strange that things work ok with 256 processes but not 300+.

It's mpich-4.1.2 @roblatham00

Masterwater-y commented 1 month ago

ncmpi_create internally calls MPI_File_open. Can you try this small MPI-IO program to see whether the hanging is because of MPI-IO or PnetCDF? https://github.com/wkliao/mpi-io-examples/blob/master/mpi_file_open.c

I tried and it still hang, maybe it is because of MPI-IO @wkliao

wkliao commented 1 month ago

What file system are you writing to? Are the MPICH versions the same on all the hosts, i.e. controller1,compute1,compute2storage,compute3storage?

Can you try adding "ufs:" as a prefix to your output file name, i.e. ufs:./output.nc?

Masterwater-y commented 1 month ago

@wkliao it works,thank you so much! The MPICH versions are the same on all the hosts.

Masterwater-y commented 3 weeks ago

@wkliao Unfortunately, the issue with the stuck creation of output.nc has reappeared. Strangely enough, when I use mpirun -n 256 -hosts controller1,compute1,compute2storage,compute3storage ./test ufs:output.nc to run the program, everything works as expected. However, when I try to use a hostfile instead, mpirun -n 256 -f hostfile ./test ufs:output.nc the problem arises.

The hostfile is like controller1:64 compute1:64 compute2storage:64 compute3storage:64

wkliao commented 3 weeks ago

The problem may be the file system you are using. What file system are you using to store file 'output.nc'?