NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
64 stars 146 forks source link

investigation of unexpected behavior of ctest rrfs_3denvar_rdasens #766

Open TingLei-NOAA opened 1 month ago

TingLei-NOAA commented 1 month ago

As Peter Johnsen via orion help desk suggested and @RussTreadon-NOAA helped , the behavior of regional GSI after the orion upgrading is being investigate in relation to the issues, found on hercules, of the netcdf error (when I_MPI_EXTRA_FILESYSTEM) /or unproducible issues (https://github.com/NOAA-EMC/GSI/issues/697), it is found rrfs_3denvar_rdasens_loproc_updat would become idle (not finished in 1 hour 30 min) using 4 nodes and ppn=5. I have to follow the recent set up: 3 nodes , ppn=40 on hera given by @hu5970 and the job could finish successfully.
It is not clear to me what caused this and if it is an spontaneous issue (since on other complaints on this up to now) and this issue is to facilitate collaborative investigation into this issue. In addition GSI developers mentioned in the above, I 'd also like to bring this to the attention of @ShunLiu-NOAA @DavidHuber-NOAA .

RussTreadon-NOAA commented 1 month ago

Thank you @TingLei-NOAA for opening this issue. This is a known problem. Please see discussions in

RDHPCS ticket #2024062754000098 has also been opened.

GSI PR #764 was merged into develop at EIB's request.

ShunLiu-NOAA commented 1 month ago

@TingLei-NOAA and @RussTreadon-NOAA Thank you for the head-up. Since there is a RDHPCS ticket, we can wait for the further action from RDHPCS.

TingLei-NOAA commented 1 month ago

@RussTreadon-NOAA Thanks for those info. I will study updates with those issues carefully first. @ShunLiu-NOAA I begin to think , maybe this issue is not specific to orion, since I see the similar set up ( more than 100 mpi processes are needed while the nodes numbers are maybe smaller ( so seems not memory issue) are made for other machines like hera/wcoss2. It is also found if the "fed" obs is not used and fed model fields are not included in the control/state variables, this rrfs test works "normally" (using similar mpi task setup as hafs and previous fv3lam test) . I will do some further digging and see what I could get.

TingLei-NOAA commented 1 month ago

It is confirmed the same behavior on hera (when ppn=5; nodes=4) , the rrfs_3denvar_rdasens_lopupdat became idle. Seems the issue occurs in the parallel reading of physvar files (dbz and fed). One mpi process failed to finish processing all levels assigned to them.

TingLei-NOAA commented 1 month ago

An update: It is confirmed this ctest rrfs_3denvar_rdasens would pass using 20 mpi tasks on wcoss2. (while it would fail on both hera and orion with the newer compiler (upgraded Rocky 9) Using 20 tasks, GSI would become idle on the 9 th mpi rank when it began to deal with fed variables of the level 1 (https://github.com/TingLei-daprediction/GSI/blob/dd341bb6b3e5aca403f9f8ea0a03692a397f29e9/src/gsi/gsi_rfv3io_mod.f90#L2894) after successfully reading in a few levels of dbz variables.

For being now, we could use the similar task numbers as on hera to let this ctest pass. But i think further investigation will be helpful. I will have more discussions (some off-line) with colleagues while I might submit a ticket for this problem).

TingLei-NOAA commented 1 month ago

An ticket with orion had been opened. A self-contained test case on hera to reproduce this issue was created and sent to R. Reddy at the helpdesk (Thanks a lot!)

RussTreadon-NOAA commented 2 weeks ago

@TingLei-NOAA , what is the status of this issue?

TingLei-NOAA commented 2 weeks ago

@RussTreadon-NOAA I will follow on this and come back when I have more updates to share.