NCAR / MPAS-Workflow

Scripts for controlling DA workflows with MPAS-Model and mpas-bundle
Apache License 2.0
20 stars 15 forks source link

Fix some memory and core/node usage for derecho #279

Closed liujake closed 9 months ago

liujake commented 9 months ago

Description

  1. I ran cycling tests for the standard 120km 3DEnVar (scenarios/3denvar_OIE120km_WarmStart_VarBC.yaml), which failed at the forecast step after 3 cycles. The failure is caused by exceeding wall-time (so forecast job was killed).

  2. With a closer look into the job setting, I found it still set to use 36 cores/node by default (though I believe this is not the cause of the failure). So changed 36 cores/node to 128 cores/node. Now 120km forecast step uses 1 node and 128 cores/node. After this change, restarting the cycling ran through 2-day cycles from 2018041418 to 2018041700.

  3. Also modified some files for memory use from old 45GB for cheyenne to new 235GB for derecho. I believe we can remove those memory setting as all derecho nodes have the same 235GB, not like cheyenne having 45GB and 109GB nodes. But I will leave that for future PRs. And I do not change the memory setting for ensemble-related files, which can be done in a future PR. This memory change should have fixed a test failure of ~test/testinput/3denvar_O30kmIE60km_WarmStart.yaml.

    modified:   initialize/applications/Forecast.py
        modified:   initialize/applications/HofX.py
    modified:   initialize/applications/Variational.py
    modified:   initialize/post/VerifyModel.py
    modified:   initialize/post/VerifyObs.py
    modified:   scenarios/defaults/forecast.yaml
    modified:   scenarios/defaults/variational.yaml
    modified:   test/testinput/3denvar_O30kmIE60km_WarmStart.yaml
    modified:   test/testinput/3dvar_O30kmIE60km_ColdStart.yaml
liujake commented 9 months ago

The workflow should run out of box without local changes for those standard settings, and with a proper/optimal use of derecho nodes. This PR is one of more PRs to gradually fix most of scenarios we are frequently working with.

During the tests, I also found another issue. See https://github.com/NCAR/MPAS-Workflow/issues/280.