NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Update RDHPCS Hera resource for `eupd` task #2636

Closed HenryRWinterbottom closed 4 weeks ago

HenryRWinterbottom commented 1 month ago

This PR addresses issue #2454. The following is accomplished:

Type of change

Change characteristics

How has this been tested?

This has been tested by @wx20jjung during experiment applications. A Rocoto workflow (e.g., XML) was provided containing the following job information for the gdaseupd task:

<task name="enkfgdaseupd" cycledefs="gdas" maxtries="&MAXTRIES;">

        <command>/scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454/jobs/rocoto/eupd.sh</command>

        <jobname><cyclestr>HW_test_enkfgdaseupd_@H</cyclestr></jobname>
        <account>da-cpu</account>
        <queue>batch</queue>
        <partition>hera</partition>
        <walltime>00:30:00</walltime>
        <nodes>16:ppn=5:tpp=8</nodes>
        <native>--export=NONE</native>

The changes to parm/config/gfs/config.resources results in the following:

<task name="enkfgdaseupd" cycledefs="gdas" maxtries="&MAXTRIES;">

        <command>/scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454/jobs/rocoto/eupd.sh</command>

        <jobname><cyclestr>x002_gwdev_issue_2454_enkfgdaseupd_@H</cyclestr></jobname>
        <account>fv3-cpu</account>
        <queue>batch</queue>
        <partition>hera</partition>
        <walltime>00:30:00</walltime>
        <nodes>16:ppn=5:tpp=8</nodes>
        <native>--export=NONE</native>

Checklist

KateFriedman-NOAA commented 1 month ago

@wx20jjung found a solution to be to change the runtime layout to 5 PEs per node with 8 threads (instead of 8 PEs/5 threads) and 80 PEs total (instead of 270). This resulted in much shorter wait times and only about 5 minutes longer run time.

@wx20jjung @HenryWinterbottom-NOAA Interesting, I'm not used to seeing a resource fix that means fewer tasks and nodes...if this is a memory issue then doesn't fewer nodes mean not enough memory? The change in this PR really only reduces the task number for this job on Hera, it will still be using 8 threads and 5 ppn (as it was before). Since the issue, that this PR aims to fix, reported that the issue was intermittent and resolvable upon rerun, was this tested over many cycles to ensure it's good and fixes the problem?

Another question...why change the default nth_eupd to be 5 threads instead of 8? That change means that every machine not already specified with a machine if-block in that section will now use 5 threads instead of 8 (e.g. Orion, Hercules, and Jet). Was that change tested on those machines at C384?

wx20jjung commented 1 month ago

@Kate Friedman - NOAA Federal @.***> The underlying problem is memory per node, not total memory. The eupd step does not have openmp statements so having threads greater than 1 just shuts down cores on the node. Using 8 tasks can sometimes cause a memory use problem within a node. Adding more nodes does not solve this memory failure as it is not a total memory problem. I am not allowed to login to the cluster nodes to monitor memory usage so I do not know what the optimum configuration should be. The global workflow is also not setup to call tasks-per-node for this step, which would help optimize the node (and memory) usage. I suspect 6 or 7 tasks (and 1 thread) would be the optimum use for the 40 core nodes on hera and jet.

I have been running the 5 task / 8 thread combination at C384 on hera and jet (kjet, 40 core nodes) for several months now with no failures. I can't comment on any of the other machines or model resolutions.

On Thu, May 30, 2024 at 10:12 AM Kate Friedman @.***> wrote:

@wx20jjung https://github.com/wx20jjung found a solution to be to change the runtime layout to 5 PEs per node with 8 threads (instead of 8 PEs/5 threads) and 80 PEs total (instead of 270). This resulted in much shorter wait times and only about 5 minutes longer run time.

@wx20jjung https://github.com/wx20jjung @HenryWinterbottom-NOAA https://github.com/HenryWinterbottom-NOAA Interesting, I'm not used to seeing a resource fix that means fewer tasks and nodes...if this is a memory issue then doesn't fewer nodes mean not enough memory? The change in this PR really only reduces the task number for this job on Hera, it will still be using 8 threads and 5 ppn (as it was before). Since the issue, that this PR aims to fix, reported that the issue was intermittent and resolvable upon rerun, was this tested over many cycles to ensure it's good and fixes the problem?

Another question...why change the default nth_eupd to be 5 threads instead of 8? That change means that every machine not already specified with a machine if-block in that section will now use 5 threads instead of 8 (e.g. Orion, Hercules, and Jet). Was that change tested on those machines at C384?

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/pull/2636#issuecomment-2139649402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA4LSQFMI5DH2WHNNQLZE4XV5AVCNFSM6AAAAABIPKNKQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZZGY2DSNBQGI . You are receiving this because you were mentioned.Message ID: @.***>

KateFriedman-NOAA commented 1 month ago

@wx20jjung Thanks for that explanation, that helps my understanding of the issue!

I have been running the 5 task / 8 thread combination at C384 on hera

So, the global-workflow already has/had 8 threads and 5 tasks (ppn = 40/8threads) for C384 so this PR is only lowering the total task number and thus resulting node number. It seems like we already have the thread/ppn solution that was working for you. Does lowering the task number and node help further then (what this PR does)? I suspect a different resource configuration is needed.

The global workflow is also not setup to call tasks-per-node for this step, which would help optimize the node (and memory) usage.

The global-workflow config.resources has npe_node_JOB variables that can be adjusted if needed. We generally just set them like this for each job:

export npe_node_eupd=$(( npe_node_max / nth_eupd ))

...but, if needed, we can set this differently for a job/resolution.

Currently we set resources based on the following variables and calculations (showing this PR eupd resources as example):

npe_node_max=40 (total number of PEs per node on Hera)
npe_eupd=80 (total number of tasks for job)
nth_eupd=8 (threads for job)
npe_node_eupd=40/8=5 (PEs per node for job)

--> nodes=npe_eupd/npe_node_eupd=80/5=16

I am not allowed to login to the cluster nodes to monitor memory usage so I do not know what the optimum configuration should be.

We can add a memory command to a job to get the memory information printed in the log if needed. It's messy output with some error messages that can be ignored...which is why we don't have it on by default on Hera. Let me know if that would help to determine the memory needed.

I suspect 6 or 7 tasks (and 1 thread) would be the optimum use for the 40 core nodes on hera and jet.

Perhaps stepping back from what I went through above...what would you suggest for the resulting xml resource statement? The current result from this PR would be: <nodes>16:ppn=5:tpp=8</nodes> Sounds like this may be what you're suggesting: <nodes>13:ppn=6:tpp=1</nodes> (note, the node value is a round down using 6ppn, which doesn't divide evenly into 80 tasks, it may end up as 14 nodes, one would have to run setup_xml step to see)

Let us know what would potentially be a better resource configuration for C384 eupd.

Note: the resource configuration method in global-workflow is being redesigned now, so feel free to provide a resource suggestion that doesn't have the current calculation constraints and we can see if we can accommodate it

wx20jjung commented 1 month ago

@Kate Friedman - NOAA Federal @.***> First, a clarification. The version(s) of global-workflow I am using have the "old" configuration of npe_eupd=270, nth_eupd=5. I changed these in my versions so that npe_eupd=80, ppn=5,tpp=8, or

16:ppn=5:tpp=8 to keep the jobs from failing on hera and jet. This keeps the *.xml consistent with the config.* file. From this point on, I have to be careful as grant funding is not allowed to "transition items to operations" and I am already in trouble for transitioning code to EMC. So, these are only suggestions. My first suggestion is to identify the total memory needed for a specific number of ensembles and resolution and observation data volume. You only need this info for a few cycles. This should identify how many nodes you will need. If possible, also check the memory requirements for each mpi task. The nature of this failure suggests the memory requirement for each task is not balanced. There are probably one or more "outliers". The compiler and hardware vendors, and RDHPCS should be able to help with this. You will need to assume all the tasks use the maximum (outlier) memory. There is no "one size fits all' configuration for the complex workflow you have. Your configurations seem to be setup for task/thread ratios per node. SLURM gives you a lot of options on how to pack a node. I do not know what the defaults are on the various machines. S4 was setup to fill a node before moving on to the next node. Some put a task on each node (round robin) until it runs out of tasks. I suggest a scenario where you put as many tasks on a node as possible to keep the MPI communication traffic across the network to a minimum. Any distribution scenario is messy and will have to be tailored for each job and the node configuration. On Thu, May 30, 2024 at 11:24 AM Kate Friedman ***@***.***> wrote: > @wx20jjung Thanks for that explanation, > that helps my understanding of the issue! > > I have been running the 5 task / 8 thread combination at C384 on hera > > So, the global-workflow already has/had 8 threads > > and 5 tasks (ppn = 40/8threads) for C384 so this PR is only lowering the > total task number and thus resulting node number. It seems like we already > have the thread/ppn solution that was working for you. Does lowering the > task number and node help further then (what this PR does)? I suspect a > different resource configuration is needed. > > The global workflow is also not setup to call tasks-per-node for this > step, which would help optimize the node (and > memory) usage. > > The global-workflow config.resources has npe_node_JOB variables > > that can be adjusted if needed. We generally just set them like this for > each job: > > export npe_node_eupd=$(( npe_node_max / nth_eupd )) > > ...but, if needed, we can set this differently for a job/resolution. > > Currently we set resources based on the following variables and > calculations (showing this PR eupd resources as example): > > npe_node_max=40 (total number of PEs per node on Hera) > npe_eupd=80 (total number of tasks for job) > nth_eupd=8 (threads for job) > npe_node_eupd=40/8=5 (PEs per node for job) > > --> nodes=npe_eupd/npe_node_eupd=80/5=16 > > I am not allowed to login to the cluster nodes to monitor memory usage so > I do not know what the optimum > configuration should be. > > We can add a memory command to a job to get the memory information printed > in the log if needed. It's messy output with some error messages that can > be ignored...which is why we don't have it on by default on Hera. Let me > know if that would help to determine the memory needed. > > I suspect 6 or 7 tasks (and 1 thread) would be the optimum use for the 40 > core nodes on hera and jet. > > Perhaps stepping back from what I went through above...what would you > suggest for the resulting xml resource statement? The current result from > this PR would be: 16:ppn=5:tpp=8

Sounds like this may be what you're suggesting:

13:ppn=6:tpp=1 (note, the node value is a round down using 6ppn, which doesn't divide evenly into 80 tasks, it may end up as 14 nodes, one would have to run setup_xml step to see) Let us know what would potentially be a better resource configuration for C384 eupd. Note: the resource configuration method in global-workflow is being redesigned now, so feel free to provide a resource suggestion that doesn't have the current calculation constraints and we can see if we can accommodate it — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you were mentioned.Message ID: ***@***.***>
aerorahul commented 1 month ago

@wx20jjung Do the changes in this PR resolve the issue reported in #2454? Thank you for your time

emcbot commented 4 weeks ago

CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2636