NOAA-GFDL / FMS

GFDL's Flexible Modeling System
Other
87 stars 128 forks source link

Fms.parallel startup #1477

Closed dkokron closed 3 months ago

dkokron commented 4 months ago

This PR is a replacement for https://github.com/NOAA-GFDL/FMS/pull/1405 which I closed accidentally.

NetCDF-4, using the HDF5 file layout, has the ability to do parallel I/O in two different modes. The two modes are referred to as “independent” while the second mode is referred to as “collective”. The collective mode has been tested with a few NOAA workloads and shown to provide substantial improvement in job startup time while reducing negative impact on the underlying Lustre file system.

This PR does not address parallel I/O via pNetCDF.

This PR adds an option to enable collective reads. The user controls that option via settings in input.nml. The default behavior is unchanged, the user has to activate collective reads using the settings in input.nml.

Fixes # 1322 https://github.com/NOAA-GFDL/FMS/issues/1322

How Has This Been Tested? I have run a RRFS (regional) case on WCOSS2 with and without collective reads activated. The resulting binary restart files are zero-diff. I have not yet run a regional HAFS case or the UFS model on a full cube.

The compile time environment used to compile FMS was: Currently Loaded Modules: 1) craype-x86-rome (H) 3) craype-network-ofi (H) 5) PrgEnv-intel/8.3.3 7) intel/19.1.3.304 9) cray-mpich/8.1.19 11) hpc-intel/19.1.3.304 13) hdf5/1.14.1 2) libfabric/1.11.0.0. (H) 4) envvar/1.0 6) cmake/3.20.2 8) craype/2.7.17 10) hpc/1.2.0 12) hpc-cray-mpich/8.1.19 14) netcdf/4.9.2

Checklist:

dkokron commented 4 months ago

Further testing of this proposal depends on availability of Acorn which is scheduled to be in dedicated time for about 2 weeks starting today.

dkokron commented 4 months ago

Will do.

On Mon, Mar 11, 2024, 12:50 PM Rusty Benson @.***> wrote:

@.**** commented on this pull request.

In fms2_io/netcdf_io.F90 https://github.com/NOAA-GFDL/FMS/pull/1477#discussion_r1520164724:

  • integer :: TileComm=MPI_COMM_NULL !< MPI communicator used for collective reads.
  • !! To be replaced with a real communicator at user request

You define a variable MPP_COMM_NULL in mpp.F90 as was done for MPP_INFO_NULL https://github.com/NOAA-GFDL/FMS/blob/main/mpp/mpp.F90#L1328-L1335 and make it public https://github.com/NOAA-GFDL/FMS/blob/main/mpp/mpp.F90#L199. This way we can keep the MPI layer confined to mpp.

In fms2_io/netcdf_io.F90 https://github.com/NOAA-GFDL/FMS/pull/1477#discussion_r1520165019:

@@ -32,6 +32,7 @@ module netcdf_io_mod use mpp_mod use fms_io_utils_mod use platform_mod +use mpi, only: MPI_COMM_NULL

Remove pursuant to comment regarding MPI_COMM_NULL below.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/FMS/pull/1477#pullrequestreview-1928601375, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACODV2BL7U4AGHOWVRONSDDYXX4G3AVCNFSM6AAAAABEQVUTL2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSMRYGYYDCMZXGU . You are receiving this because you authored the thread.Message ID: @.***>

rem1776 commented 4 months ago

@dkokron Just wanted to give you a heads up, it looks like this failed the linter CI check for our style guidelines. This will just need a line length fixed to be less than 120 characters, looks like its in fms2_io/netcdf_io.F90.

thomas-robinson commented 4 months ago

@dkokron are you still looking at a March 25 time frame for testing? I'm trying to plan out a testing tag schedule.

dkokron commented 4 months ago

I'm not sure what you mean by testing. I've run the code (as of today) using all the UFS cases that I'm interested in testing.

thomas-robinson commented 4 months ago

Further testing of this proposal depends on availability of Acorn which is scheduled to be in dedicated time for about 2 weeks starting today.

@dkokron this is what I meant by testing. You indicated that Acorn would be available around March 25. If you are satisfied with this PR and the testing done on your side, we can complete our code reviews and schedule it for merging and regression testing on our side.

dkokron commented 4 months ago

@thomas-robinson Acorn returned earlier than expected and I was able to get my testing completed yesterday.

dkokron commented 3 months ago

Is there anything more I need to do to move this forward?

MatthewPyle-NOAA commented 3 months ago

@bensonr Has this change made it into any alpha type release? Trying to understand the path it will take to being deployed in an FMS release. Thanks!

bensonr commented 3 months ago

@bensonr Has this change made it into any alpha type release? Trying to understand the path it will take to being deployed in an FMS release. Thanks!

It was initially released as part of the 2024.01 beta4 release. Let us know if you encounter any issues in using it.

MatthewPyle-NOAA commented 2 months ago

@bensonr Things are mostly looking good using this beta4 release (did have a regression test of the model fail with it, but don't believe FMS is responsible for that). What is the timeline for an official 2024.01 release? Thanks!

thomas-robinson commented 2 months ago

@MatthewPyle-NOAA the release will (hopefully) be today or tomorrow pending our internal discussion at noon today. There will be a follow on patch in about a week.