NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 150 forks source link

investigation of unexpected behavior of ctest rrfs_3denvar_rdasens #766

Open TingLei-NOAA opened 4 months ago

TingLei-NOAA commented 4 months ago

As Peter Johnsen via orion help desk suggested and @RussTreadon-NOAA helped , the behavior of regional GSI after the orion upgrading is being investigate in relation to the issues, found on hercules, of the netcdf error (when I_MPI_EXTRA_FILESYSTEM) /or unproducible issues (https://github.com/NOAA-EMC/GSI/issues/697), it is found rrfs_3denvar_rdasens_loproc_updat would become idle (not finished in 1 hour 30 min) using 4 nodes and ppn=5. I have to follow the recent set up: 3 nodes , ppn=40 on hera given by @hu5970 and the job could finish successfully.
It is not clear to me what caused this and if it is an spontaneous issue (since on other complaints on this up to now) and this issue is to facilitate collaborative investigation into this issue. In addition GSI developers mentioned in the above, I 'd also like to bring this to the attention of @ShunLiu-NOAA @DavidHuber-NOAA .

RussTreadon-NOAA commented 4 months ago

Thank you @TingLei-NOAA for opening this issue. This is a known problem. Please see discussions in

RDHPCS ticket #2024062754000098 has also been opened.

GSI PR #764 was merged into develop at EIB's request.

ShunLiu-NOAA commented 4 months ago

@TingLei-NOAA and @RussTreadon-NOAA Thank you for the head-up. Since there is a RDHPCS ticket, we can wait for the further action from RDHPCS.

TingLei-NOAA commented 4 months ago

@RussTreadon-NOAA Thanks for those info. I will study updates with those issues carefully first. @ShunLiu-NOAA I begin to think , maybe this issue is not specific to orion, since I see the similar set up ( more than 100 mpi processes are needed while the nodes numbers are maybe smaller ( so seems not memory issue) are made for other machines like hera/wcoss2. It is also found if the "fed" obs is not used and fed model fields are not included in the control/state variables, this rrfs test works "normally" (using similar mpi task setup as hafs and previous fv3lam test) . I will do some further digging and see what I could get.

TingLei-NOAA commented 4 months ago

It is confirmed the same behavior on hera (when ppn=5; nodes=4) , the rrfs_3denvar_rdasens_lopupdat became idle. Seems the issue occurs in the parallel reading of physvar files (dbz and fed). One mpi process failed to finish processing all levels assigned to them.

TingLei-NOAA commented 4 months ago

An update: It is confirmed this ctest rrfs_3denvar_rdasens would pass using 20 mpi tasks on wcoss2. (while it would fail on both hera and orion with the newer compiler (upgraded Rocky 9) Using 20 tasks, GSI would become idle on the 9 th mpi rank when it began to deal with fed variables of the level 1 (https://github.com/TingLei-daprediction/GSI/blob/dd341bb6b3e5aca403f9f8ea0a03692a397f29e9/src/gsi/gsi_rfv3io_mod.f90#L2894) after successfully reading in a few levels of dbz variables.

For being now, we could use the similar task numbers as on hera to let this ctest pass. But i think further investigation will be helpful. I will have more discussions (some off-line) with colleagues while I might submit a ticket for this problem).

TingLei-NOAA commented 4 months ago

An ticket with orion had been opened. A self-contained test case on hera to reproduce this issue was created and sent to R. Reddy at the helpdesk (Thanks a lot!)

RussTreadon-NOAA commented 3 months ago

@TingLei-NOAA , what is the status of this issue?

TingLei-NOAA commented 3 months ago

@RussTreadon-NOAA I will follow on this and come back when I have more updates to share.

RussTreadon-NOAA commented 2 months ago

@TingLei-daprediction , what is the status of this issue? PR #788 is a workaround, not a solution.

TingLei-NOAA commented 2 months ago

@RussTreadon-NOAA Experts on RDHPCS helps desk haven't made progresses on this. We agreed that their work on this could be on hold with that ticked open and I will keep them posted if I have any new findings. I will find chances to deeper investigation into this issue if it work for other GSI developers.

RussTreadon-NOAA commented 2 months ago

Thank you @TingLei-NOAA . We periodically cycle through open GSI issues and PRs asking developers for updates. Developer feedback helps with planning and coordinating. Sometimes we even find issues which can be closed or PRs abandoned.

TingLei-NOAA commented 2 months ago

@RussTreadon-NOAA Really appreciate your help on all those issues/problems we encountered in this "transition period"!

RussTreadon-NOAA commented 1 week ago

Problems with the rrfs_3denvar_rdasens test now occur on Gaea, Jet, and Hera. The patch, thus far, is to alter the job configuration. The underlying cause for the hangs remains identified, confirmed, and resolved.

Is this an accurate assessment, @TingLei-NOAA ? If not, please update this issue with where we currently stand on this issue.

ShunLiu-NOAA commented 1 week ago

@RussTreadon-NOAA Ting is on leave for two weeks. He will work on it when he returns to work.