ESCOMP / CDEPS

Community Data Models for Earth Prediction Systems
https://escomp.github.io/CDEPS/versions/master/html/index.html
20 stars 45 forks source link

Running a case with files with zero size doesn't give a graceful error #254

Open ekluzek opened 10 months ago

ekluzek commented 10 months ago

I ran a case with some forcing files that had zero size. Obviously that is a problem and I shouldn't expect it to run. The problem is that I don't see an error that I can figure out what the problem is. There's nothing in the atm.log file, and the cesm.log contains nothing informative. Since, most files were fine, my spot checking a few files didn't spot the issue. It would be helpful if when it encounters such a problem a useful error message is given about the zero size file. So I know what the problem is, what file it is, and how to fix it.

The test is:

SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default

Where many of the files had zero size, such as...

Model datm missing file file203 = '/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/atm_forcing.datm7.NLDAS2.0.125d.v1/Precip/ctsmforc.NLDAS2.0.125d.v1.Prec.1996-11.nc'
Model datm missing file file204 = '/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/atm_forcing.datm7.NLDAS2.0.125d.v1/Precip/ctsmforc.NLDAS2.0.125d.v1.Prec.1996-12.nc'
Model datm missing file file205 = '/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/atm_forcing.datm7.NLDAS2.0.125d.v1/Precip/ctsmforc.NLDAS2.0.125d.v1.Prec.1997-01.nc'
Model datm missing file file206 = '/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/atm_forcing.datm7.NLDAS2.0.125d.v1/Precip/ctsmforc.NLDAS2.0.125d.v1.Prec.1997-02.nc'
Model datm missing file file207 = '/glade/campaign/cesm/cesmdata/inputdata/atm/datm7/atm_forcing.datm7.NLDAS2.0.125d.v1/Precip/ctsmforc.NLDAS2.0.125d.v1.Prec.1997-03.nc'

cesm..log:

0: (t_initf)       profile_single_file=      F
0: (t_initf)       profile_global_stats=     T
0: (t_initf)       profile_ovhd_measurement= F
0: (t_initf)       profile_add_detail=       F
0: (t_initf)       profile_papi_enable=      F
0: /glade/work/erik/ctsm_worktrees/main_dev/share/src/shr_file_mod.F90         912 This routine is depricated - use shr_log_setLogUnit instead         -13
0:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
1:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
21:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
33:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
34:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
35:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
2:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
3:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
4:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
5:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
7:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
8:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
9:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
10:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
12:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
13:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
14:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
15:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
16:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
17:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
18:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
19:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
20:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
22:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
23:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
24:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
24:Obtained 10 stack frames.
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x1219b8c]
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x1219c90]
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x121a02b]
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x121fbdc]
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe(PIOc_openfile+0x7f) [0x12182b6]
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x11c3e0e]
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x106ac07]
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x107b3a4]
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x105e19c]
24:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x563b85]
25:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
26:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
26:Obtained 10 stack frames.
26:/glade/scratch/erik/SMS_D_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.cheyenne_gnu.clm-default.GC.ctsm51d154redochlist/bld/cesm.exe() [0x1219b8c]
27:Abort with message Unknown error in file operation in file /glade/scratch/vanderwb/hpci-stack/221214-1434/23959/pio-2.5.10/src/clib/pioc_support.c at line 2832
27:Obtained 10 stack frames.

The only other file with content is the med.log file, which doesn't have anything remarkable.

Turning on the PET files I see this:

0 : ATM-TO-MED-State
20231126 201433.298 INFO             PET000   ESM0001: InitializeIPDv02p3 intro: local peCount=  1
20231126 201433.298 INFO             PET000   ESM0001: InitializeIPDv02p3 intro: local accDeviceCount=  0
20231126 201433.298 INFO             PET000   ESM0001: InitializeIPDv02p3 intro: ssiLocalPetCount= 36
20231126 201433.298 INFO             PET000   >>>ESM0001: entered Initialize (phase=IPDv02p3) with current time: 2000  1  1  0  0  0   0
20231126 201433.298 INFO             PET000   (atm_comp_nuopc):(InitializeRealize)  called
20231126 201433.772 INFO             PET000   (shr_strdata_init_from_config) called

So we can see it's in the datm and in stream forcing files, but we can't tell what specific file(s) it's failing in.

billsacks commented 10 months ago

It looks like this error is generated from PIO's check_netcdf2 function. I'm wondering if we can modify that function to write out the file name responsible for the error. @jedwards4b does that seem reasonable to you? If so, this could be something that I take on, tied in with your hope that other people will contribute to PIO development. (From some initial investigation, I think I'll have some questions for you, but I'd be happy to take a stab at the implementation if the idea seems reasonable.)

jedwards4b commented 10 months ago

I think that it might be better to add this as a feature to check_input_data so that the error is captured before the model starts. The unknown file error comes from the underlying netcdf layer, I'm not sure how you would change it without significant affect on performance.

billsacks commented 10 months ago

I like the idea of catching this particular error in check_input_data. At the same time, I think it would be helpful for PIO to output as much information as it can when it hits an error like this. From my quick look through the PIO code, it looks like PIO knows the file name at the point where this error is generated (in this case, from PIOc_openfile_retry), so it seems like PIO could pass the filename into check_netcdf2 where it could be appended to the error message that comes out of the NetCDF layer. @jedwards4b I'm curious what you think about that approach. If it would be easier to discuss this over Zoom, we can do that.

jedwards4b commented 10 months ago

Adding the file name to the error message would be a good addition and I support that. Trying to determine if I am opening a 0 length file would, I think, be expensive.

billsacks commented 10 months ago

See https://github.com/NCAR/ParallelIO/issues/1979