E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
346 stars 353 forks source link

Pre-defined PE layouts are broken for NA RRM AMIP runs with maint-2.0 #5025

Open tangq opened 2 years ago

tangq commented 2 years ago

We need to start the v2 NARRM AMIP runs with high-frequency output very soon for the RRM overview paper. When testing it on chrysalis, the pre-defined M and L layouts got errors:

Model mpassi missing file graph64 = '/lcrc/group/e3sm/data/inputdata/ice/mpas-cice/WC14to60E2r3/mpas-seaice.graph.info.200714.part.64'

The v2 NARRM AMIP runs with standard output were tested successfully with both M and L layouts on chrysalis.

Failed tests: /lcrc/group/e3sm/ac.qtang/E3SMv2/old/v2.NARRM.amip_0101_bonus/tests Successful tests: /lcrc/group/e3sm/ac.qtang/E3SMv2/v2.NARRM.amip_0101/tests

Differing env_mach_pes.xml shows differences below:

diff /lcrc/group/e3sm/ac.qtang/E3SMv2/v2.NARRM.amip_0101_bonus/tests/M_2x5_ndays/case_scripts/env_mach_pes.xml /lcrc/group/e3sm/ac.qtang/E3SMv2/v2.NARRM.amip_0101/tests/M_2x5_ndays/case_scripts/env_mach_pes.xml 
26c26
<   <comment>none</comment>
---
>   <comment> fmod030c64x1 s=6.2 </comment>
28c28
<     <entry id="COST_PES" value="64">
---
>     <entry id="COST_PES" value="1920">
32c32
<     <entry id="TOTALPES" value="64">
---
>     <entry id="TOTALPES" value="1920">
62,69c62,69
<         <value compclass="ATM">-1</value>
<         <value compclass="CPL">-1</value>
<         <value compclass="OCN">-1</value>
<         <value compclass="WAV">-1</value>
<         <value compclass="GLC">-1</value>
<         <value compclass="ICE">-1</value>
<         <value compclass="ROF">-1</value>
<         <value compclass="LND">-1</value>
---
>         <value compclass="ATM">1920</value>
>         <value compclass="CPL">1920</value>
>         <value compclass="OCN">1920</value>
>         <value compclass="WAV">1</value>
>         <value compclass="GLC">1</value>
>         <value compclass="ICE">1920</value>
>         <value compclass="ROF">1920</value>
>         <value compclass="LND">1920</value>
78,84c78,84
<         <value compclass="ATM">64</value>
<         <value compclass="OCN">64</value>
<         <value compclass="WAV">64</value>
<         <value compclass="GLC">64</value>
<         <value compclass="ICE">64</value>
<         <value compclass="ROF">64</value>
<         <value compclass="LND">64</value>
---
>         <value compclass="ATM">1920</value>
>         <value compclass="OCN">1920</value>
>         <value compclass="WAV">1</value>
>         <value compclass="GLC">1</value>
>         <value compclass="ICE">1920</value>
>         <value compclass="ROF">1920</value>
>         <value compclass="LND">1920</value>
ndkeen commented 2 years ago

I see this is on chrysalis. Do we know if there is a testname that can reproduce this RRM? I don't see one in cime/config/e3sm/tests.py

tangq commented 2 years ago

@ndkeen , good question - I am not sure if the NA RRM configurations used in the E3SMv2 production runs are tested routinely or not.

rljacob commented 2 years ago

northamericax4v1pg2_WC14to60E2r3.WCYCL1850.*.allactive-wcprodrrm is in the prod test suite.

tangq commented 2 years ago

Is the AMIP NA RRM tested by northamericax4v1pg2_WC14to60E2r3.WCYCL1850.*.allactive-wcprodrrm? If not, we will need to add the AMIP test.

tangq commented 2 years ago

I reproduced the pre-defined "L" layout, which was used for the production run, with the following xmlchange commands.

./xmlchange COST_PES=3840
./xmlchange NTASKS_ATM=3840
./xmlchange NTASKS_CPL=3840
./xmlchange NTASKS_OCN=3840
./xmlchange NTASKS_WAV=1
./xmlchange NTASKS_GLC=1
./xmlchange NTASKS_ICE=3840
./xmlchange NTASKS_ROF=3840
./xmlchange NTASKS_LND=3840
ndkeen commented 2 years ago

I ran SMS.northamericax4v1pg2_WC14to60E2r3.WCYCL1850.chrysalis_intel.allactive-wcprodrrm which completed 5 days and would be used for 'M' size layouts. It uses 80 nodes total. I do not know if this test is similar enough to what is failing for you.

tangq commented 2 years ago

The test for this configuration should be something like SMS.northamericax4v1pg2_F20TR.*.

rljacob commented 2 years ago

We also have conusx4v1_r05_oECv3.F2010 in the integration test suite.

What is the difference between conusx4v1 and northamericax4v1 ?

tangq commented 2 years ago

conusx4x1 is for E3SMv1, whereas northamericax4x1 is for E3SMv2.

We will need a test for northamericax4v1pg2_WC14to60E2r3.F20TR (if doesn't exist), which is the configuration used in the v2 AMIP production runs.

rljacob commented 2 years ago

Found the problem. The CIME update included splitting the pe layout files to components but the EAM component layouts weren't updated. PR https://github.com/E3SM-Project/E3SM/pull/4928 needs to be added to maint-2.0.

tangq commented 2 years ago

That makes sense and it highlights the importance of testing production configurations. If northamericax4v1pg2_WC14to60E2r3.F20TR was in the test suit, we would have caught it when merging PR #4928 .