Open jedwards4b opened 3 years ago
The test comes out of the box with pelayout:
Comp NTASKS NTHRDS ROOTPE
CPL : 36/ 2; 0
ATM : 36/ 2; 0
LND : 36/ 2; 0
ICE : 36/ 2; 0
OCN : 216/ 2; 36
ROF : 36/ 2; 0
GLC : 36/ 2; 0
WAV : 36/ 2; 0
IAC : 1/ 1; 0
ESP : 1/ 1; 0
if I change NTHRDS=1 it passes.
The following related tests pass:
ERS.T62_g16.G.cheyenne_pgi.pop-default (NTHRDS_OCN=1)
ERS_D.T62_g16.G.cheyenne_pgi.pop-default (NTHRDS_OCN=1)
ERS.T62_g16.G1850ECO.cheyenne_gnu.pop-cice_ecosys
ERS.T62_g16.G1850ECO.cheyenne_intel.pop-cice_ecosys
ERS_D.T62_g16.G1850ECO.cheyenne_gnu.pop-cice_ecosys
ERS_D.T62_g16.G1850ECO.cheyenne_intel.pop-cice_ecosys
ERS_D.T62_g16.G1850ECO.cheyenne_pgi.pop-cice_ecosys
fails, aborting earlier in the run than ERS.T62_g16.G1850ECO.cheyenne_pgi.pop-cice_ecosys
, with multiple Subscript out of range for array uarea
error messages from line 638 of grid.F90, which is
UAREA(:,:,iblock) = DXU(:,:,iblock)*DYU(:,:,iblock)
The error messages include bounds, with values such as
subscript=1, lower bound=396713728, upper bound=47395747491999, dimension=1
subscript=1, lower bound=1609522944, upper bound=47441122783391, dimension=1
subscript=1, lower bound=440, upper bound=441, dimension=1
UAREA
is defined at the module level with dimension(nx_block,ny_block,max_blocks_clinic)
.
I'm suspecting a pgi
compiler problem that is specific to threads.
@jedwards4b, are other components passing tests with pgi
on cheyenne with multiple threads?
I'm wondering if we should just be setting NTHRDS_OCN=1
for compiler=pgi
.
The test SMS_P216x2_D.T62_g16.G.cheyenne_pgi.pop-default
also fails, aborting on the same line of code as ERS_D.T62_g16.G1850ECO.cheyenne_pgi.pop-cice_ecosys
.
The related test SMS_P216x2.T62_g16.G.cheyenne_pgi.pop-default
passes.
Test SMS_P216x2_D.T62_g16.G.cheyenne_intel.pop-default and SMS_P216x2_D.T62_g16.G.cheyenne_gnu.pop-default both pass as well.
Rather than changing the PE layout maybe we should just run ERS.T62_g16.G1850ECO.cheyenne_pgi.pop-cice_ecosys (removing the debug option).
Isn't that the original test that failed?
I added debug in an attempt to tease out more info on the failure.
Here's a non-threaded debug test that is failing with pgi, though I don't see why: CASE=SMS_P432x1_D.T62_g16.G.cheyenne_pgi.pop-default.20201112_101617_pz68a0 CASEROOT=/glade/scratch/klindsay/$CASE This has the same POP block sizes as the G1850ECO tests that are failing.
Oops, my mistake - nevermind.
Regarding case SMS_P432x1_D.T62_g16.G.cheyenne_pgi.pop-default Subscript out of range for array irqrs (cesm2_x_alpha/components/ww3/src/source/w3initmd.f90: 1901)
There are too many tasks for the ww3 model, I reran this test after doing NTASKS_WAV=72 and it completes without error.
Description of the issue:
The test fails on cheyenne with a threading error at line 1069 of passive_tracers.F90
Version:
Machine/Environment Description:
Currently Loaded Modules
1) ncarenv/1.3 2) cmake/3.14.4 3) pgi/19.3 4) openmpi/3.1.4 5) netcdf-mpi/4.7.3 6) pnetcdf/1.12.1 7) ncarcompilers/0.5.0
Any xml/namelist changes or SourceMods:
As defined by the test.