NOAA-EMC / fv3atm

Other
33 stars 157 forks source link

Model not running with multiple write groups for large RRFS regional domain #787

Closed MatthewPyle-NOAA closed 8 months ago

MatthewPyle-NOAA commented 9 months ago

Description

When attempting to use multiple write groups to keep pace with sub-hourly output, experienced the model hanging before true integration had begun, and eventually timing out.

 in fv3 cap init, output_startfh=  0.0000000E+00  iau_offset=           0
 output_fh=  9.9999998E-03  0.2500000      0.5000000      0.7500000
   1.000000       1.500000       2.000000       2.500000       3.000000
   4.000000       5.000000       6.000000     lflname_fulltime= T
 fcst_advertise, cpl_grid_id=           1
=>> PBS: job killed: walltime 1748 exceeded limit 1713

To Reproduce:

What compilers/machines are you seeing this with?

Seen on WCOSS2 with Intel built code.

Multiple write groups can run without a problem for smaller dimensioned domains, but seems to have trouble with the 3950x2700x65 dimensioned RRFS North America regional domain.

Additional context

Add any other context about the problem here. Directly reference any issues or PRs in this or other repositories that this is related to, and describe how they are related. Example:

Output

Screenshots If applicable, drag and drop screenshots to help explain your problem.

output logs If applicable, include relevant output logs. Either drag and drop the entire log file here (if a long log) or

paste the code here (if a short section of log)

Testing:

  1. Have you tested the code changes? On what platforms?

  2. Have you run regression test in ufs-weather-model or ufs-s2s-model with code changes?

    • Will the baseline results change?
    • If the baseline results change, is it expected? Please give brief explanation.

Dependent PRs:

Directly reference any issues or PRs in this or other repositories that this is related to, and describe how they are related. Example:

MatthewPyle-NOAA commented 9 months ago

A quick update - got things run with 2x128 task write groups, but writing to a tiny output grid. Also worked for a reasonably large output grid (roughly 80% of the size of the RRFS output grid).

Shifting to a 2 x 192 write task setup allowed the writing of the full RRFS output grid. So slowly getting unstuck.

MatthewPyle-NOAA commented 8 months ago

Have a working quilt specification - just required more quilt nodes than expected.