NCAR / ParallelIO

A high-level Parallel I/O Library for structured grid applications
Apache License 2.0
134 stars 52 forks source link

MPT/2.25 error when using PIO_NETCDF4P format. #1955

Open jedwards4b opened 1 year ago

jedwards4b commented 1 year ago

This is a bug in the MPI library mpt/2.25

Using mpt/2.25 intel/19.1.1 https://github.com/jedwards4b/netcdf-c/tree/jedwards/add_udf2 this is a recent branch of netcdf-c main. ​hdf5 1.12.2

I am doing a parallel write of netcdf4 format with 1 task and running into a write error in the HDF5 layer. Debugging shows the error is happening at line 1620 of H5FDmpio.c This code to confirm the number of bytes written is purely diagnostic so I was able to comment out the section of code from line 1600 to 1620 after which I have confirmed that the file is written correctly and the diagnostic which generates the error is incorrect.

I've confirmed that the bug is in the mpi layer by building and running with openmpi/4.1.4

​when using mpt on the third call to MPI_get_elements_x a count of 9 is returned. ​ I added a print statement at line 1620

For the successful openmpi case I see: 39: bytes_written 6270 io_size 6270 39: bytes_written 18874368 io_size 18874368 39: bytes_written 18874368 io_size 18874368 39: bytes_written 3548 io_size 3548 39: bytes_written 48 io_size 48

For the failing mpt case I see: 39: bytes_written 6270 io_size 6270 39: bytes_written 18874368 io_size 18874368 39: bytes_written 9 io_size 18874368 39: ERROR: 0 NetCDF: HDF error err_num = -101 fname = /glade/u/home/jedwards/sandboxes/pio/src/clib/pio_darray_int.c line = 463