NOAA-EMC / RTOFS_GLO

0 stars 1 forks source link

Allow for model scalability beyond 1800 NTASK (bugzilla #1297) #22

Open DanIredell-NOAA opened 2 years ago

DanIredell-NOAA commented 2 years ago

http://www2.spa.ncep.noaa.gov/bugzilla/show_bug.cgi?id=1297

Currently we are limited to only running the forecast job using a max of 1800 cores due to Cice's hard set NTASK value of 1800.

This hard limit on scalability makes hard to improve the science(decrease time-step or run it faster) and/or fully utilize resources.

For example:
Current reservation line for rtofs_global_forecast_step2

PBS -l place=vscatter:excl,select=15:ncpus=120:mpiprocs=120

Using only 120 cores out of the allowed 128 per node. The code is not memory bound so, 120 cores are idling in this case.

The following would be more efficient, but would require a different NTASK

PBS -l place=vscatter:excl,select=15:ncpus=128:mpiprocs=128

DanIredell-NOAA commented 11 months ago

First, what we use in operations now is this: (we set exclhost, not excl)

PBS -l place=vscatter:exclhost,select=15:ncpus=120:mpiprocs=120

Options for the place statement:

Modifer       Meaning
free          Place job on any vnode(s)
pack          All chunks will be taken from one host
scatter       Only one chunk is taken from any host
vscatter      Only one chunk is taken from any vnode.  Each chunk must fit on a vnode.
excl          Only this job uses the vnodes chosen
exclhost      The entire host is allocated to this job
shared        This job can share the vnodes chosen
DanIredell-NOAA commented 11 months ago

Second - we can create another tile layout for HYCOM that is more than 1800 tasks. That would require creating another patch.input and changing about another half dozen parm files (blkdat,input, ice_in). Also the scripts would need modifying to know which set of these files to use (based on NTASK).

And would need another hycom executable as it is compiled with NTASKS set. It is NPX NPY and in the current case that is 450 4. See comp_ice.csh.

DanIredell-NOAA commented 9 months ago

At the V2.4.0 kickoff meeting it was determined that this would be put on hold until MOM-CICE version planned for RTOFS v3.0 in 2026.