COSIMA / cice5

Clone of The Los Alamos sea ice model (CICE) with ACCESS drivers. See https://github.com/CICE-Consortium/CICE-svn-trunk/tree/cice-5.1.2
4 stars 13 forks source link

Crash in thickness_changes (ice_therm_vertical.f90) #3

Closed aidanheerdegen closed 6 years ago

aidanheerdegen commented 6 years ago

Turns out the crash I thought was MATM (https://github.com/OceansAus/matm/issues/4) is in cice.

This is not a CICE issue, but I thought it important to document in case someone else has the same problem.

To recap, this is the ACCESS-OM-1deg JRA55 RYF config, but with a new (KDS50) vertical level scheme. I have interpolated the initial conditions but as far as I know nothing else depends on the ocean vertical grid.

The crash is a divide by zero, the initial traceback has no information:

Image              PC                Routine            Line        Source             
cice_auscom_360x3  000000000092C391  Unknown               Unknown  Unknown
cice_auscom_360x3  000000000092A4CB  Unknown               Unknown  Unknown
cice_auscom_360x3  00000000008D9274  Unknown               Unknown  Unknown
cice_auscom_360x3  00000000008D9086  Unknown               Unknown  Unknown
cice_auscom_360x3  0000000000857A49  Unknown               Unknown  Unknown
cice_auscom_360x3  00000000008623F9  Unknown               Unknown  Unknown
libpthread-2.12.s  00002B9E5C3CE7E0  Unknown               Unknown  Unknown
cice_auscom_360x3  0000000000624132  Unknown               Unknown  Unknown
cice_auscom_360x3  0000000000621D5C  Unknown               Unknown  Unknown
cice_auscom_360x3  00000000005F92C6  Unknown               Unknown  Unknown
cice_auscom_360x3  000000000040E75C  Unknown               Unknown  Unknown
cice_auscom_360x3  000000000040C47D  Unknown               Unknown  Unknown
cice_auscom_360x3  000000000040C41E  Unknown               Unknown  Unknown
libc-2.12.so       00002B9E5C5FAD1D  __libc_start_main     Unknown  Unknown
cice_auscom_360x3  000000000040C329  Unknown               Unknown  Unknown

even though I recompiled cice with -g. If I load the core dump with gdb, I get this info:

#4  ice_therm_vertical::thickness_changes (nx_block=Cannot access memory at address 0x1
) at ice_therm_vertical.f90:1556
#5  0x0000000000621d5c in ice_therm_vertical::thermo_vertical (nx_block=Cannot access memory at address 0x1
) at ice_therm_vertical.f90:421
#6  0x00000000005f92c6 in ice_step_mod::step_therm1 (dt=Cannot access memory at address 0x1
) at ice_step_mod.f90:481
#7  0x000000000040e75c in ice_step () at CICE_RunMod.f90:323
#8  cice_runmod::cice_run () at CICE_RunMod.f90:180
#9  0x000000000040c47d in icemodel () at CICE.f90:57
#10 0x000000000040c41e in main ()
#11 0x00002b9e5c5fad1d in __libc_start_main () from /lib64/libc.so.6
#12 0x000000000040c329 in _start ()
(gdb) where
#0  0x00002b9e5c60e495 in raise () from /lib64/libc.so.6
#1  0x00002b9e5c60fc75 in abort () from /lib64/libc.so.6
#2  0x0000000000861d4c in for__signal_handler ()
#3  <signal handler called>
#4  ice_therm_vertical::thickness_changes (nx_block=Cannot access memory at address 0x1
) at ice_therm_vertical.f90:1556
#5  0x0000000000621d5c in ice_therm_vertical::thermo_vertical (nx_block=Cannot access memory at address 0x1
) at ice_therm_vertical.f90:421
#6  0x00000000005f92c6 in ice_step_mod::step_therm1 (dt=Cannot access memory at address 0x1
) at ice_step_mod.f90:481
#7  0x000000000040e75c in ice_step () at CICE_RunMod.f90:323
#8  cice_runmod::cice_run () at CICE_RunMod.f90:180
#9  0x000000000040c47d in icemodel () at CICE.f90:57
#10 0x000000000040c41e in main ()
#11 0x00002b9e5c5fad1d in __libc_start_main () from /lib64/libc.so.6
#12 0x000000000040c329 in _start ()
(gdb) bt full                                                                                                                                  
#0  0x00002b9e5c60e495 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00002b9e5c60fc75 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000000000861d4c in for__signal_handler ()
No symbol table info available.
#3  <signal handler called>
No symbol table info available.
#4  ice_therm_vertical::thickness_changes (nx_block=Cannot access memory at address 0x1
) at ice_therm_vertical.f90:1556
        phi_i_mushy = 0.84999999999999998
        qbot0 = 0
        qbotp = 0
        qbotm = 0
        hstot = 0
        wk1 = 0
        qbot = 0
        ts = 0
        ti = 0
        tmlts = 0
        ij = 30936576
        j = 21206080
        i = 33728
        dzi = (( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...) ...)
#5  0x0000000000621d5c in ice_therm_vertical::thermo_vertical (nx_block=Cannot access memory at address 0x1
) at ice_therm_vertical.f90:421
        my_task = 7
        dhi = 0
        ij = 30936576
        fadvocn = (( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...) ...)
        iage = (( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...) ...)

The line number is probably not reliable (still -02 flag on), so I think the crash is here:

https://github.com/OceansAus/cice5/blob/49b36d4bfb97328e818d428d0b8438144dbd69a1/source/ice_therm_vertical.F90#L1554

as qbotp = 0.

I'm guessing there is some ice initial conditions issues, but I don't know why change the ocean vertical grid would impact the ice. Any ideas?

russfiedler commented 6 years ago

@aidanheerdegen Those values of i, j and ij are ludicrous and are way out of bounds. I think something else is going on here.

aidanheerdegen commented 6 years ago

I recompiled with debug on (so -O0) and bt full using gdb and the core dump gave this:

at ice_therm_vertical.f90:1556
        phi_i_mushy = 0.84999999999999998
        qbot0 = -9.2559631349317831e+61
        qbotp = -9.2559631349317831e+61
        qbotm = -9.2559631349317831e+61
        hstot = 0
        zqsnew = -9.2559631349317831e+61
        wk1 = -797388.44339828182
        hqtot = -4024092.2779668011
        qsub = -9.2559631349317831e+61
        qbot = -279501620.48217207
        ts = -9.2559631349317831e+61
        ti = -9.2559631349317831e+61
        tmlts = -0.17222207969971798
        dhs = 4.5467548571190606e-05
        dhi = -0
        k = 5
        ij = 199
        j = 248
        i = 9
        qmlt = (( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...) ...)
        qm = (( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...) ...)
        dzs = <error reading variable dzs (Cannot access memory at address 0x7ffc964dcd70)>
        dzi = (( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...) ...)
        zs2 = <error reading variable zs2 (Cannot access memory at address 0x7ffc964d2000)>
        zs1 = <error reading variable zs1 (Cannot access memory at address 0x7ffc964ce630)>
        zi2 = (( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...) ...)
        zi1 = (( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...) ...)

In particular, that these can all have dodgy values:

        qbot0 = -9.2559631349317831e+61
        qbotp = -9.2559631349317831e+61
        qbotm = -9.2559631349317831e+61
        hstot = 0
        zqsnew = -9.2559631349317831e+61
        wk1 = -797388.44339828182
        hqtot = -4024092.2779668011
        qsub = -9.2559631349317831e+61
        qbot = -279501620.48217207
        ts = -9.2559631349317831e+61
        ti = -9.2559631349317831e+61
        tmlts = -0.17222207969971798
        dhs = 4.5467548571190606e-05
        dhi = -0

Is problematic.

aidanheerdegen commented 6 years ago

@russfiedler the unoptimised version gives more sensible numbers for the indices (i,j,ij).

russfiedler commented 6 years ago

Looks like a grid problem. (i,j)=(9,248) should be land (88.5E,65.2N). About 500km inland in fact...

aidanheerdegen commented 6 years ago

Well spotted @russfiedler, thanks. You're right, so it shouldn't be surprising that there are garbage values there, but why is it even trying to calculate anything there at all? I use the same model inputs (only changing the vertical grid and topog (partial cells)), and it works fine. It uses the same kmt.nc file which is where it seems to be getting it's ocean mask from. I'm a little confused.

russfiedler commented 6 years ago

Hang on, I might be wrong on the location there. I think i,j might just refer to a local i,j rather than global. What tile are we on?

aidanheerdegen commented 6 years ago

Not sure. I tried to wring that info out of the core dump with no success.

I have to run it through a debugger, but ran out of time today. Thanks for the feedback @russfiedler

aidanheerdegen commented 6 years ago

Forgot to ping @nicjhan

aidanheerdegen commented 6 years ago

Thanks to @nicjhan I have discovered it was a land/sea mask mismatch. I didn't use the most recent topography with the changes to the Bering Strait that Nic made.

russfiedler commented 6 years ago

@aidanheerdegen Was the location at about (i,j)=113,248), (x,y)=(-167,65.6)? I worked out 5 possible coordinates from the crash info above. Just checking if I was right or was I barking up the wrong tree?