ESCOMP / POP2-CESM

Parallel Ocean Program (POP2) in CESM
http://www.cesm.ucar.edu/models/cesm2/ocean/
4 stars 24 forks source link

cesm2.2.0 fails test ERS.T62_g16.G1850ECO.cheyenne_pgi.pop-cice_ecosys #43

Open jedwards4b opened 3 years ago

jedwards4b commented 3 years ago

Description of the issue:

The test fails on cheyenne with a threading error at line 1069 of passive_tracers.F90

Version:

Machine/Environment Description:

Currently Loaded Modules
1) ncarenv/1.3 2) cmake/3.14.4 3) pgi/19.3 4) openmpi/3.1.4 5) netcdf-mpi/4.7.3 6) pnetcdf/1.12.1 7) ncarcompilers/0.5.0

Any xml/namelist changes or SourceMods:

As defined by the test.

jedwards4b commented 3 years ago

The test comes out of the box with pelayout:

Comp  NTASKS  NTHRDS  ROOTPE
CPL :     36/     2;      0
ATM :     36/     2;      0
LND :     36/     2;      0
ICE :     36/     2;      0
OCN :    216/     2;     36
ROF :     36/     2;      0
GLC :     36/     2;      0
WAV :     36/     2;      0
IAC :      1/     1;      0
ESP :      1/     1;      0

if I change NTHRDS=1 it passes.

klindsay28 commented 3 years ago

The following related tests pass:

ERS.T62_g16.G.cheyenne_pgi.pop-default (NTHRDS_OCN=1)
ERS_D.T62_g16.G.cheyenne_pgi.pop-default (NTHRDS_OCN=1)
ERS.T62_g16.G1850ECO.cheyenne_gnu.pop-cice_ecosys
ERS.T62_g16.G1850ECO.cheyenne_intel.pop-cice_ecosys
ERS_D.T62_g16.G1850ECO.cheyenne_gnu.pop-cice_ecosys
ERS_D.T62_g16.G1850ECO.cheyenne_intel.pop-cice_ecosys

ERS_D.T62_g16.G1850ECO.cheyenne_pgi.pop-cice_ecosys fails, aborting earlier in the run than ERS.T62_g16.G1850ECO.cheyenne_pgi.pop-cice_ecosys, with multiple Subscript out of range for array uarea error messages from line 638 of grid.F90, which is

      UAREA(:,:,iblock) = DXU(:,:,iblock)*DYU(:,:,iblock)

The error messages include bounds, with values such as

    subscript=1, lower bound=396713728, upper bound=47395747491999, dimension=1
    subscript=1, lower bound=1609522944, upper bound=47441122783391, dimension=1
    subscript=1, lower bound=440, upper bound=441, dimension=1

UAREA is defined at the module level with dimension(nx_block,ny_block,max_blocks_clinic).

I'm suspecting a pgi compiler problem that is specific to threads.

@jedwards4b, are other components passing tests with pgi on cheyenne with multiple threads?

I'm wondering if we should just be setting NTHRDS_OCN=1 for compiler=pgi.

klindsay28 commented 3 years ago

The test SMS_P216x2_D.T62_g16.G.cheyenne_pgi.pop-default also fails, aborting on the same line of code as ERS_D.T62_g16.G1850ECO.cheyenne_pgi.pop-cice_ecosys.

The related test SMS_P216x2.T62_g16.G.cheyenne_pgi.pop-default passes.

jedwards4b commented 3 years ago

Test SMS_P216x2_D.T62_g16.G.cheyenne_intel.pop-default and SMS_P216x2_D.T62_g16.G.cheyenne_gnu.pop-default both pass as well.

jedwards4b commented 3 years ago

Rather than changing the PE layout maybe we should just run ERS.T62_g16.G1850ECO.cheyenne_pgi.pop-cice_ecosys (removing the debug option).

klindsay28 commented 3 years ago

Isn't that the original test that failed?

I added debug in an attempt to tease out more info on the failure.

Here's a non-threaded debug test that is failing with pgi, though I don't see why: CASE=SMS_P432x1_D.T62_g16.G.cheyenne_pgi.pop-default.20201112_101617_pz68a0 CASEROOT=/glade/scratch/klindsay/$CASE This has the same POP block sizes as the G1850ECO tests that are failing.

jedwards4b commented 3 years ago

Oops, my mistake - nevermind.

jedwards4b commented 3 years ago

Regarding case SMS_P432x1_D.T62_g16.G.cheyenne_pgi.pop-default Subscript out of range for array irqrs (cesm2_x_alpha/components/ww3/src/source/w3initmd.f90: 1901)

jedwards4b commented 3 years ago

There are too many tasks for the ww3 model, I reran this test after doing NTASKS_WAV=72 and it completes without error.