E3SM-Project / scorpio

A high-level Parallel I/O Library for structured grid applications
18 stars 16 forks source link

Potential hanging issue with ncmpi_enddef() when opening large files for writing on Lustre file systems #566

Closed dqwu closed 4 months ago

dqwu commented 4 months ago

This problem was initially identified on Frontier with the SCREAM eamxx ne1024 F case (not reproducible with ne256 F case).

During the initial run, a sizable history file exceeding 400 GB is generated, utilizing PnetCDF IO type of SCORPIO, with a stripe count of 64 and a stripe size of 16 MB. Additionally, 10 KB of extra space is allocated in the file header, as follows:

File header:
    Size: 13660 bytes
    Extent: 23900 bytes

In subsequent restart runs, when attempting to open this file in write mode to add an attribute, the hanging issue occurs, specifically during the invocation of the PnetCDF API ncmpi_enddef() within SCORPIO.

Notably, the hanging problem appears unrelated to the expected growth of the file header:

When a simple PnetCDF test program is used to open this file for adding new metadata, the hanging problem does not manifest when the file is copied to a run directory with the default Lustre Progressive File Layout (PFL). However, the issue reoccurs when the file is copied to a run directory with a specified stripe size, explicitly set using the "lfs setstripe" command.

Joint investigation with PnetCDF developer @wkliao reveals that this hanging problem arises from unexpected file header growth related to the PnetCDF function ncmpi__enddef().

The ncmpi__enddef() function transitions an opened NetCDF file out of define mode, with the v_align argument impacting the starting file offset of the first variable, which is the file header extent.

The typical call of ncmpi_enddef(ncid) is equivalent to ncmpi__enddef(ncid, 0, 0, 0, 0).

In the absence of explicit user hints on "nc_header_align_size," PnetCDF implicitly sets it based on the following hierarchy:

In the creation of the history file, SCORPIO calls ncmpi__enddef(ncid, 10240, 4, 0, 4) to reserve 10 KB of free header space, setting the v_align argument to 4 bytes by default. PnetCDF sets nc_header_align_size to 4 bytes for header alignment in this scenario. The resulting header extent is relatively small, only requiring it to be a multiple of 4.

When opening the history file to add an attribute, SCORPIO calls ncmpi_enddef(ncid), internally invoking ncmpi__enddef(ncid, 0, 0, 0, 0). As v_align is 0, PnetCDF sets nc_header_align_size to the file striping size (16 MB), since the total size of all fixed-size variables exceeds 900 MB. In this case, PnetCDF attempts to set the new file header extent to 16 MB, the smallest multiple of the striping size larger than the file header size. This operation can be time-consuming (leading to hanging) as all fixed and record variables (over 400 GB in this case) are shifted to higher file offsets to accommodate the new file header.

Two exceptions when PnetCDF avoids using the striping size (16 MB) as nc_header_align_size during attribute addition to the history file:

[Fixes on PnetCDF side] Starting from version 1.13.0, the use of the file striping unit will be entirely eliminated due to its overly aggressive nature, i.e., automatically determining file extent for users. Users can still employ the hint nc_header_align_size or the v_align argument passed to ncmpi__enddef() to specify a desired header alignment.

[Workaround on SCORPIO side] To address this issue for PnetCDF versions prior to 1.13.0, SCORPIO can substitute the typical call of ncmpi_enddef(ncid) with ncmpi__enddef(ncid, 0, 4, 0, 4), explicitly setting v_align to 4 bytes. This effectively prevents PnetCDF from selecting the file striping size for the unspecified nc_header_align_size.

dqwu commented 4 months ago

A related hanging issue reported by @brhillman, who is seeing a hang when initializing certain output streams with EAMxx at high resolution (ne1024) on frontier. The initialization seems to be hanging at the following point:

#0  0x00007fff3503363b in cxip_evtq_progress (evtq=evtq@entry=0x1cb81e18) at prov/cxi/src/cxip_evtq.c:395
#1  0x00007fff35005059 in cxip_ep_progress (fid=<optimized out>) at prov/cxi/src/cxip_ep.c:184
#2  0x00007fff3500a7f9 in cxip_util_cq_progress (util_cq=0x1c569a10) at prov/cxi/src/cxip_cq.c:112
#3  0x00007fff34fe6191 in ofi_cq_readfrom (cq_fid=0x1c569a10, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
#4  0x00007fffe9f6d587 in MPIR_Wait_impl.part.0 () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#5  0x00007fffead641f6 in MPIC_Wait () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#6  0x00007fffead6a8fe in MPIC_Recv () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#7  0x00007fffeac85746 in MPIR_Gather_intra_binomial () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#8  0x00007fffe936af77 in MPIR_Gather () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#9  0x00007fffeaf6b629 in MPIR_CRAY_Allgather () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#10 0x00007fffe9100cf7 in PMPI_Allgather () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#11 0x00007fffeb7726f1 in ADIOI_CRAY_ReadStridedColl () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#12 0x00007fffeb742cd0 in MPIOI_File_read_all () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#13 0x00007fffeb7447d1 in PMPI_File_read_at_all () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#14 0x00007fffed4507f3 in move_file_block () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
#15 0x00007fffed450403 in move_record_vars () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
#16 0x00007fffed44fd6f in ncmpio.enddef () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
#17 0x00007fffed393f43 in ncmpi_enddef () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
#18 0x0000000002d496b8 in pioc_change_def ()
#19 0x0000000002d66b38 in PIOc_enddef ()
#20 0x0000000001d5e512 in eam_pio_enddef$scream_scorpio_interface_ ()
#21 0x0000000001d86346 in eam_pio_enddef_c2f ()
#22 0x0000000001d7df20 in scream::scorpio::eam_pio_enddef(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#23 0x0000000001d906f3 in scream::OutputManager::setup_file(scream::IOFileSpecs&, scream::IOControl const&) ()
#24 0x0000000001d8a0e4 in scream::OutputManager::setup(ekat::Comm const&, ekat::ParameterList const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<scream::FieldManager>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<scream::FieldManager> > > > const&, std::shared_ptr<scream::GridsManager const> const&, scream::util::TimeStamp const&, scream::util::TimeStamp const&, bool) ()
#25 0x0000000001ad0280 in scream::control::AtmosphereDriver::initialize_output_managers() ()
#26 0x0000000000647814 in scream_init_atm ()
#27 0x000000000064216c in atm_init_mct$atm_comp_mct_ ()
#28 0x00000000004625cb in component_init_cc$component_mod_ ()
#29 0x000000000042f1ee in cime_init$cime_comp_mod_ ()
#30 0x00000000004600e3 in main ()