NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 147 forks source link

NCO Bugzilla tickets to be addressed in GFS v17 DA #356

Open RussTreadon-NOAA opened 2 years ago

RussTreadon-NOAA commented 2 years ago

NCO opened numerous GFS v16 related bugzilla tickets which must be addressed in GFS v17 or beyond. GSI issue #137 documents GFS v16 DA bugzillas. New GFS DA bugzillas have been opened since #137.

NCO requests that these bugzilla and remaining GFS v16 DA bugzillas be addressed in GFS v17 DA.

Note: The above list will likely grow as GFS v17 progresses, GFS v16 issues are discovered, and NCO provides feedback.

RussTreadon-NOAA commented 2 years ago

FYI: global-workflow issue #712 has been opened to track bugzilla 1301 from the g-w side.

StevenEarle-NCO commented 2 years ago

I ran a proof of concept test, similar as Russ did... Initial test simply changing ln to cp yielded nearly double the runtime... about 53 minutes, where normal is 29 minutes. It took about 24 minutes to get to the gsi executable. I ran another test where I sent all the copy commands to a file, then ran mpmd process (run all the copy commands at the same time on different cores). This echo+cp took only 30 seconds and the runtime dropped to 28 minutes. The analysis already allocates over 7000 cores so I recommend making use of them whenever possible to make the cp commands run in parallel.

RussTreadon-NOAA commented 2 years ago

Smart use of mpmd, @StevenEarle-NCO! Scripts can be examined and refactored, where possible, to wrap multiple in/out copies within mpmd. Need to ensure this works on WCOSS2 and RDHPCS machines.

RussTreadon-NOAA commented 2 years ago

FYI: General discussion of replacement of ln with cp/mv is occurring in g-w issue #712

RussTreadon-NOAA commented 2 years ago

@dtkleist , @CoryMartin-NOAA , @CatherineThomas-NOAA - for your awareness.

GFS v17 can NOT use links in working directories. In case you do not have access to bugzilla, here's the content of bugzilla 1301

gfs - write to files in working directory instead of links pointing to COMOUT

[Wei Wei](mailto:wei.wei@noaa.gov) 2022-04-06 13:27:21 UTC
In the current version of GFS, some jobs write to COMOUT directly through links in working directories. 

This is risky because downstream jobs can potentially get partial files and fail, as happened in wave_post job. 

Please write to working directories, then cpfs (or cp/mv, depends on the file sizes) to COMOUT once the files are completed.
[Wei Wei](mailto:wei.wei@noaa.gov) 2022-04-07 13:16:17 UTC
Updates from Steven:

"
I asked Wei to submit this ticket because we need to get back to all of production in a self contained DATA per process/model. We can't have direct writes to COMOUT.
We allowed this to happen on WCOSS because we didn't have the IO/storage bandwidth to support self contained DATA.  It's time to go back to where we were several years ago to improve:
-- Portability
-- Contained IO, making management of the system/storage possible
-- Pristine place to save after failures for debug/troubleshooting later

We've designed WCOSS2 to have an all flash/ssd filesystem (f1/f2), which has superior performance... 10x the aggregate bandwidth when compared to the fastest filesystem on WCOSS1. GFS/GDAS cannot currently take advantage of that because COMOUT is on h1, which is designed for long term storage. 
As we design and procure future systems, we can ensure adequate bandwidth to support local, self contained working spaces. We can't do that when models use external links.

Please give it a try on WCOSS2 and let us know how much delay there is and/or how many more resources you need to support this requirement.

"

This bugzilla ticket is for the next major upgrade, GFSv17.

We need to address this for both the GSI- and JEDI- based pieces of GFS v17 DA.

RussTreadon-NOAA commented 9 months ago

@CatherineThomas-NOAA , the GSI Handling Review team is going through GSI issue to see which, if any, we can close. Since this issue mentions GFS v17, I'm assume that we need to keep it open. Do you agree?

CatherineThomas-NOAA commented 9 months ago

@RussTreadon-NOAA I agree, we need to keep this issue open. Thanks for checking.

Tagging @JessicaMeixner-NOAA for awareness.

RussTreadon-NOAA commented 9 months ago

Thank you @CatherineThomas-NOAA for the confirmation. We will keep this issue open.

RussTreadon-NOAA commented 4 months ago

@CatherineThomas-NOAA : do we have anyone (EIB or DA) working on this issue?

CatherineThomas-NOAA commented 4 months ago

@RussTreadon-NOAA: This issue was mentioned by @aerorahul last week, though I'm not sure if any work has started yet. @aerorahul, does the workflow team need anything from DA on this?

aerorahul commented 4 months ago

The cp/ln issue is in many scripts of the workflow, and we are not equipped to resolve all of them in one go. Some of the biggest issues are in the forecast and analysis jobs where the volume of data and time are crucial.
We are tackling as we go. Any help is appreciated.

RussTreadon-NOAA commented 4 months ago

Thank you @aerorahul for the update. I'll add you as an assignee but feel free to reassign to other EIB staff.

RussTreadon-NOAA commented 1 week ago

@CatherineThomas-NOAA and @aerorahul , what is the status of this issue?