ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
295 stars 298 forks source link

Not enough error checking for adding new restart variables #2616

Open ekluzek opened 6 days ago

ekluzek commented 6 days ago

Brief summary of bug

I added a new restart variable, and used the dim1name of "patch" instead of "pft".

General bug information

CTSM version you are using: branch_tags/dustemisdev.n05_ctsm5.1.dev166-5-gf48830977

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: When adding new variables to the restart file

Details of bug

An example of where bad error messaging makes it hard to find problems in the code. I found the problem by pulling it up in DDT and then realized the issue when it came up on the define part, and not the write part. I thought it might have been because of bad data in the array to write, or the interpinic_flag.

Important details of your setup / configuration so we can reproduce the bug

    call restartvar(ncid=ncid, flag=flag, varname='OBU', xtype=ncd_double,  &
         dim1name='patch', &
         long_name='Monin-Obukhov length', units='m', &
         interpinic_flag='skip', readvar=readvar, data=this%obu_patch)

Important output or errors that show the problem

The cesm.log does point to the error, but it's obfuscated enough with tons of output that it's hard to see.

 /glade/work/erik/ctsm_worktrees/dust_dev/share/src/shr_file_mod.F90         912 This routine is depricated - use shr_log_setLogUnit instead         -12
 /glade/work/erik/ctsm_worktrees/dust_dev/share/src/shr_file_mod.F90         912 This routine is depricated - use shr_log_setLogUnit instead         -13
 /glade/work/erik/ctsm_worktrees/dust_dev/share/src/shr_file_mod.F90         912 This routine is depricated - use shr_log_setLogUnit instead         -12
 /glade/work/erik/ctsm_worktrees/dust_dev/share/src/shr_file_mod.F90         912 This routine is depricated - use shr_log_setLogUnit instead         -13
Abort with message NetCDF: Invalid dimension ID or name in file /glade/derecho/scratch/jedwards/tmp/spack-stage/spack-stage-parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/spack-src/src/clib/pio_nc.c at line 812
Obtained 10 stack frames.
/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/gcc-12.2.0/parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/lib/libpioc.so(print_trace+0x32) [0x14c46a7a228c]
/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/gcc-12.2.0/parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/lib/libpioc.so(piodie+0x77) [0x14c46a7a2399]
/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/gcc-12.2.0/parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/lib/libpioc.so(check_netcdf2+0x242) [0x14c46a7a272d]
/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/gcc-12.2.0/parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/lib/libpioc.so(check_netcdf+0x34) [0x14c46a7a24e9]
/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/gcc-12.2.0/parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/lib/libpioc.so(PIOc_inq_dimid+0x3a0) [0x14c46a7c3801]
/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/gcc-12.2.0/parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/lib/libpiof.so(__pio_nf_MOD_inq_dimid_id+0xb1) [0x14c46aa138cc]
/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/gcc-12.2.0/parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/lib/libpiof.so(__pio_nf_MOD_inq_dimid_desc+0x3d) [0x14c46aa13994]
/glade/derecho/scratch/erik/ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold.20240624_131200_uzttv8/bld/cesm.exe() [0x5af763]
/glade/derecho/scratch/erik/ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold.20240624_131200_uzttv8/bld/cesm.exe() [0x5af871]
/glade/derecho/scratch/erik/ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold.20240624_131200_uzttv8/bld/cesm.exe() [0x71c18c]

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x14c4616efd4f in ???
    at /usr/src/debug/glibc-2.31-150300.41.1.x86_64/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#1  0x14c4616efcbb in __GI_raise
    at ../sysdeps/unix/sysv/linux/raise.c:51
#2  0x14c4616f1354 in __GI_abort
    at /usr/src/debug/glibc-2.31-150300.41.1.x86_64/stdlib/abort.c:79
#3  0x14c46a7a239d in piodie
    at /glade/derecho/scratch/jedwards/tmp/spack-stage/spack-stage-parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/spack-src/src/clib/pioc_support.c:561
#4  0x14c46a7a272c in check_netcdf2
    at /glade/derecho/scratch/jedwards/tmp/spack-stage/spack-stage-parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/spack-src/src/clib/pioc_support.c:683
#5  0x14c46a7a24e8 in check_netcdf
    at /glade/derecho/scratch/jedwards/tmp/spack-stage/spack-stage-parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/spack-src/src/clib/pioc_support.c:632
#6  0x14c46a7c3800 in PIOc_inq_dimid
    at /glade/derecho/scratch/jedwards/tmp/spack-stage/spack-stage-parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/spack-src/src/clib/pio_nc.c:812
#7  0x14c46aa138cb in __pio_nf_MOD_inq_dimid_id
    at /glade/derecho/scratch/jedwards/tmp/spack-stage/spack-stage-parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/spack-src/src/flib/pio_nf.F90:519
#8  0x14c46aa13993 in __pio_nf_MOD_inq_dimid_desc
    at /glade/derecho/scratch/jedwards/tmp/spack-stage/spack-stage-parallelio-2.6.2-x3vfh2bjkpjsumev4h7myd7wf3jvvjub/spack-src/src/flib/pio_nf.F90:448
#9  0x5af762 in __ncdio_pio_MOD_ncd_inqdid
    at /glade/work/erik/ctsm_worktrees/dust_dev/src/main/ncdio_pio.F90.in:469
#10  0x5af870 in __ncdio_pio_MOD_ncd_defvar_bygrid
    at /glade/work/erik/ctsm_worktrees/dust_dev/src/main/ncdio_pio.F90.in:1257
#11  0x71c18b in __restutilmod_MOD_restartvar_1d_double
    at /glade/work/erik/ctsm_worktrees/dust_dev/src/utils/restUtilMod.F90.in:325
#12  0xa3cf54 in __frictionvelocitymod_MOD_restart
    at /glade/work/erik/ctsm_worktrees/dust_dev/src/biogeophys/FrictionVelocityMod.F90:443
ekluzek commented 6 days ago

This is in the same vein as #1913 and #144

Fixing this would just be adding dimexist options to the ncd_inqdid calls and check it.

This is something that should be done on b4b-dev. It's also the type of thing that having simple I/O testing would help with. So the functional test framework would be a good place for this to be tested in.