ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
295 stars 299 forks source link

Fail in f19 derecho_intel test with fire_emis test mod #2300

Open ekluzek opened 6 months ago

ekluzek commented 6 months ago

Brief summary of bug

This test fails in release-cesm2.2 on derecho at RUN

ERP_D_Ld5.f19_g17.I2000Clm50BgcCruGs.derecho_intel.clm-fire_emis

This also was shown in cesm2.2.2-rc.01

General bug information

CTSM version you are using: release-cesm2.2.04

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected:

./xmlchange --force CLM_BLDNML_OPTS="-fire_emis" --append

Details of bug

This is the only fire-emission test for release-cesm2.2.04. There are many f19 tests that run fine.

ERI_N2_Ld9.f19_g17.I2000Clm50BgcCrop.derecho_intel.clm-default
ERP_D_Ld5.f19_g17.I2000Clm50BgcCruGs.derecho_intel.clm-default
ERP_D_Ld5.f19_g17.IHistClm50SpCru.derecho_intel.clm-drydepnomegan
ERP_D_Ld5.f19_g17_gl4.I1850Clm50BgcCrop.derecho_intel.clm-glcMEC_changeFlags
ERP_D_Ld9.f19_g17.I2000Clm50Cn.derecho_intel.clm-drydepnomegan
ERP_Ld5.f19_g17.I1850Clm50Bgc.derecho_intel.clm-default
ERP_Ld5.f19_g17.I2000Clm50BgcCruGs.derecho_intel.clm-default
ERP_P128x2_D.f19_g17.I2000Clm50SpRtmFl.derecho_intel.clm-default
ERP_P128x2_D_Ld5.f19_g17.I2000Clm50Sp.derecho_intel.clm-default
ERP_P128x2_D_Ld5.f19_g17_gl4.I1850Clm50BgcCropG.derecho_intel.clm-default
ERP_P128x2_D_Ld5.f19_g17_gl4.I1850Clm50BgcCropG.derecho_intel.clm-glcMEC_increase
ERS_D_Ld3.f19_g17_gl4.I1850Clm50BgcCrop.derecho_intel.clm-clm50dynroots
LII_D_Ld3.f19_g17_gl4.I2000Clm50BgcCrop.derecho_intel.clm-glcMEC_spunup_1way
SMS.f19_g17.I2000Clm50Cn.derecho_intel.clm-default
SMS_D.f19_f19_mg17.I2010Clm50Sp.derecho_intel.clm-clm50cam6LndTuningMode
SMS_D_Ln9_P512x3.f19_g17.IHistClm50SpGs.derecho_intel.clm-waccmx_offline
SMS_Ld1.f19_g17.I2000Clm50Vic.derecho_intel.clm-default
SMS_Ld5.f19_g17.I2000Clm45Fates.derecho_intel.clm-FatesColdDef
SMS_Ld5.f19_g17.I2000Clm50Fates.derecho_intel.clm-FatesColdDef
SMS_Ld5.f19_g17.IHistClm50Bgc.derecho_intel.clm-decStart
SMS_Lm1.f19_g17.I1850Clm50BgcCropCmip6waccm.derecho_intel.clm-basic
SMS_Lm1.f19_g17_gl4.I1850Clm50Bgc.derecho_intel.clm-clm50dynroots
SMS_Lm13.f19_g17.I2000Clm50BgcCrop.derecho_intel.clm-cropMonthOutput
SMS_Ln9_P128x3.f19_g17.IHistClm50SpGs.derecho_intel.clm-waccmx_offline2005Start
SSP_D_Ld10.f19_g17.I1850Clm50Bgc.derecho_intel.clm-rtmColdSSP
SSP_Ld10.f19_g17.I1850Clm50Bgc.derecho_intel.clm-rtmColdSSP

It looks like the problem is in writing out data.

Important output or errors that show the problem

cesm.log

dec1750.hsn.de.hpc.ucar.edu 1420: /var/run/palsd/6168e1d3-7c4a-4f68-a794-b7584cb5e3d1/files/cesm.exe() [0xcc5e2c]
dec1753.hsn.de.hpc.ucar.edu 1536: /glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/cray-mpich/8.1.27/oneapi/2023.2.1/zyhu/lib/libpiof.so(piodarray_mp_write_darray_1d_double_+0xe6) [0x1466c892b846]
dec1730.hsn.de.hpc.ucar.edu 128: MPICH ERROR [Rank 128] [job id 6168e1d3-7c4a-4f68-a794-b7584cb5e3d1] [Tue Dec 12 01:01:30 2023] [dec1730] - Abort(-1) (rank 128 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 128
ekluzek commented 6 months ago

Note this test passed on Cheyenne in release-cesm2.2.02

PASS ERP_D_Ld5.f19_g17.I2000Clm50BgcCruGs.cheyenne_intel.clm-fire_emis RUN time=222

Cheyenne was using 40 nodes for CTSM 1440 tasks, and Derecho uses 12 nodes for 1536 tasks.

ekluzek commented 6 months ago

Note the similar fire_emis test passes in ctsm5.1.dev159 for CTSM main development

PASS ERP_D_Ld5.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-fire_emis RUN time=155

And the release-clm5.0.37 test passes as well...

ERP_D_Ld5.f19_g17.I2000Clm50BgcCruGs.derecho_intel.clm-fire_emis