E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 364 forks source link

Fail with SMS.T62_oQU120_ais20.MPAS_LISIO_TEST when trying newer version of gcc9 #4495

Closed ndkeen closed 2 years ago

ndkeen commented 3 years ago

ON cori-knl, we currently use gcc/8.3.0. Both SMS.T62_oQU120_ais20.MPAS_LISIO_TEST and SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST complete with gnu using gcc/8.3.0 (though both show a 10% memory growth in the first 5 days which may be unrelated -- even more memory growth with Intel compiler).

I tried updating to gcc/9.3.0 and I found an error with SMS.T62_oQU120_ais20.MPAS_LISIO_TEST, however, SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST still completes. Note the odd filenames:

-rw-rw-r--  1 ndk ndk     9961 Aug 26 13:13  atm.log.46099248.210826-131221
-rw-rw-r--  1 ndk ndk   229513 Aug 26 13:13 'log.seaice.0125.d****.err'
-rw-rw-r--  1 ndk ndk   238495 Aug 26 13:13 'log.seaice.0120.d****.err'
-rw-rw-r--  1 ndk ndk   516380 Aug 26 13:13 'log.seaice.0119.d****.err'
-rw-rw-r--  1 ndk ndk   572879 Aug 26 13:13 'log.seaice.0079.d****.err'
-rw-rw-r--  1 ndk ndk    65859 Aug 26 13:13 'log.seaice.0078.d****.err'
-rw-rw-r--  1 ndk ndk   488992 Aug 26 13:13 'log.seaice.0076.d****.err'
-rw-rw-r--  1 ndk ndk   197716 Aug 26 13:13 'log.seaice.0063.d****.err'
-rw-rw-r--  1 ndk ndk   371334 Aug 26 13:13  log.seaice.0062.d0480.err
-rw-rw-r--  1 ndk ndk    16877 Aug 26 13:13 'log.seaice.0054.d****.err'
-rw-rw-r--  1 ndk ndk   235309 Aug 26 13:13 'log.seaice.0049.d****.err'
-rw-rw-r--  1 ndk ndk   460980 Aug 26 13:13  abort_seaice_0001-01-01_00.00.00_block_55.nc
-rw-rw-r--  1 ndk ndk   498144 Aug 26 13:13  abort_seaice_0001-01-01_00.00.00_block_51.nc
-rw-rw-r--  1 ndk ndk   447236 Aug 26 13:13  abort_seaice_0001-01-01_00.00.00_block_123.nc
-rw-rw-r--  1 ndk ndk   504648 Aug 26 13:13  abort_seaice_0001-01-01_00.00.00_block_121.nc
-rw-rw-r--  1 ndk ndk   498144 Aug 26 13:13  abort_seaice_0001-01-01_00.00.00_block_124.nc
-rw-rw-r--  1 ndk ndk  1042966 Aug 26 13:13 'log.seaice.0124.d****.err'
-rw-rw-r--  1 ndk ndk   650986 Aug 26 13:13 'log.seaice.0123.d****.err'
-rw-rw-r--  1 ndk ndk   616144 Aug 26 13:13 'log.seaice.0121.d****.err'
-rw-rw-r--  1 ndk ndk   416271 Aug 26 13:13 'log.seaice.0055.d****.err'
-rw-rw-r--  1 ndk ndk   192515 Aug 26 13:13  log.seaice.0051.d0480.err
-rw-rw-r--  1 ndk ndk  2346468 Aug 26 13:13 'log.seaice.0050.d****.err'
-rw-rw-r--  1 ndk ndk   489960 Aug 26 13:13  abort_seaice_0001-01-01_00.00.00_block_50.nc
-rw-rw-r--  1 ndk ndk 37319108 Aug 26 13:13  abort_seaice_0001-01-01_00.00.00.nc
drwxrwxr-x  3 ndk ndk     4096 Aug 26 13:13  ./
-rw-rw-r--  1 ndk ndk    23623 Aug 26 13:13  e3sm.log.46099248.210826-131221

Fails during init.

Looking inside one of the log.seaice*err files:

[THREAD 0001]ERROR:  lhcoef:      54296.582095907586
[THREAD 0000]ERROR:  -------------------------------------
[THREAD 0001]ERROR:  qpond:       0.0000000000000000
[THREAD 0000]ERROR: 
[THREAD 0001]ERROR:  qocn:       -7773474.9656276256
[THREAD 0000]ERROR:  picard convergence failed!
[THREAD 0001]ERROR:  Spond:       0.0000000000000000
[THREAD 0000]ERROR:  ==========================
[THREAD 0001]ERROR:  sss:         33.852275464508651
[THREAD 0000]ERROR: 
[THREAD 0001]ERROR:  w:           0.0000000000000000
[THREAD 0000]ERROR:  Surface: Tsf0, Tsf
[THREAD 0001]ERROR:  flwoutn:    -283.68549129504186
[THREAD 0000]ERROR:            0  -5.4520369009510725       -6.8676374492563470
[THREAD 0001]ERROR:  fsensn:      158.75579219638533
[THREAD 0000]ERROR: 
[THREAD 0001]ERROR:  flatn:       50.234602563188375
[THREAD 0000]ERROR:  Snow: zTsn0(k), zTsn(k), zqsn0(k), ks(k), Sswabs(k)
[THREAD 0001]ERROR:  fsurfn:      206.11199869258780
[THREAD 0000]ERROR:            1   0.0000000000000000        0.0000000000000000       -110121000.00000000        2.4199364581990490E-315  0.43414662542892668
[THREAD 0001]ERROR:  fcondtop:    206.11199869258789
[THREAD 0000]ERROR:            2   0.0000000000000000        0.0000000000000000       -110121000.00000000        6.3240402667679558E-322  0.79372940882439558
[THREAD 0001]ERROR:  fcondbot:   -73.065524370972966
[THREAD 0000]ERROR:            3   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000       0.80608108835141035
[THREAD 0001]ERROR:  fadvheat:    0.0000000000000000
[THREAD 0000]ERROR:            4   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000       0.66691269675390585
[THREAD 0001]ERROR: 
[THREAD 0000]ERROR:            5   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000       0.54871535007760019
[THREAD 0001]ERROR:  -------------------------------------
[THREAD 0000]ERROR: 
[THREAD 0001]ERROR:  temperature_changes_salinity: Picard solver non-convergence (no snow)
[THREAD 0000]ERROR:  Ice: zTin0(k), zTin(k), zSin0(k), zSin(k), phi(k), zqin0(k), km(k), Iswabs(k), dSdt(k)
[THREAD 0000]ERROR:            1  -16.018981433548070       -9.0917464307194447       0.29632209767504536       0.29632209767504536        1.5986555158525879E-003  -336670418.42955548       
 2.2971823696533096        41.046411482476906       -0.0000000000000000
[THREAD 0001]ERROR: column_vertical_thermodynamics: ice: Vertical thermo error: picard_solver: Picard solver non-convergence

backtrace not useful:
121: Program received signal SIGABRT: Process abort signal.
121: 
121: Backtrace for this error:
121: #0  0x154b1bf in ???
121:    at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
121: #1  0x19eea20 in raise
121:    at ../sysdeps/unix/sysv/linux/raise.c:51
121: #2  0x1aa6940 in abort
121:    at /home/abuild/rpmbuild/BUILD/glibc-2.26/stdlib/abort.c:79
121: #3  0x1329c81 in ???
121: #4  0x12f817a in ???
ndkeen commented 3 years ago

On cori-haswell, the same test also fails when I try using gcc/9.3.0. However, at least the filenames aren't weird. Here is a message from one of the log.seaice*err files:

----------------------------------------------------------------------
Beginning MPAS-seaice Error Log File for task      52 of      64
    Opened at 2021/08/26 14:32:51
----------------------------------------------------------------------

ERROR:  -------------------------------------
ERROR: 
ERROR:  picard convergence failed!
ERROR:  ==========================
ERROR: 
ERROR:  Surface: Tsf0, Tsf
ERROR:            0  -3.9155511566211860       -8.1992955785862041
ERROR: 
ERROR:  Snow: zTsn0(k), zTsn(k), zqsn0(k), ks(k), Sswabs(k)
ERROR:            1   0.0000000000000000        0.0000000000000000       -110121000.00000000        2.8855013936690457E-315   6.0813540619779087E-003
ERROR:            2   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        1.2205973969978724E-002
ERROR:            3   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        1.4184852943987516E-002
ERROR:            4   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        1.4172771222996426E-002
ERROR:            5   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        1.4176870291731538E-002
ERROR: 
ERROR:  Ice: zTin0(k), zTin(k), zSin0(k), zSin(k), phi(k), zqin0(k), km(k), Iswabs(k), dSdt(k)
ERROR:            1  -15.943060879665545       -9.8200207555060697       0.29457371164102458       0.29457371164102458        1.5936914836161479E-003  -336524102.12608898        2.2971911187601264        59.372514414292624       -0.
0000000000000000
ERROR:            2  -14.783856749613904       -11.130389257538972        1.3902275385133005        1.3902275385133005        7.8624412443636972E-003  -332551250.66417003        2.2861424473068088        9.0199110846378012       -0.
0000000000000000
ERROR:            3  -12.851706376301790       -10.730683325412805        2.2997965722540088        2.2997965722540088        1.4100777439189024E-002  -327031215.22958064        2.2751473797634292        5.3488797815183728       -0.
0000000000000000
ERROR:            4  -10.535219881415051       -9.3230185284908842        2.8162619364877655        2.8162619364877655        1.9285407108742349E-002  -320994551.22147530        2.2660094699708413        3.3436685698739375       -0.
0000000000000000
ERROR:            5  -7.9997675009115170       -7.3347754312024422        3.0636821823713998        3.0636821823713998        2.4212126760918416E-002  -314543815.93350124        2.2573261265838811        2.2229553594109261       -0.
0000000000000000
ERROR:            6  -4.7610297889587452       -5.0141592526030827        3.6087679131069565        3.6087679131069565        4.4625030703673858E-002  -302049792.14898217        2.2213483833847745        1.5721067256749337       -0.
0000000000000000
ERROR:            7  -2.3437151243269518       -2.7311241160277553        6.9764388293998980        6.9710500369156359       0.16805101643073556       -259979309.73695552        2.0038100835408286        21.144585959321155       -1.
4968868011838672E-006
ERROR: 
ERROR:  Ice boundary: q(k)
ERROR:            0   0.0000000000000000
ERROR:            1   0.0000000000000000
ERROR:            2   0.0000000000000000
ERROR:            3   0.0000000000000000
ERROR:            4   0.0000000000000000
ERROR:            5   0.0000000000000000
ERROR:            6   0.0000000000000000
ERROR:            7   0.0000000000000000
ndkeen commented 3 years ago

Same behaviour on chrysalis, but you can reproduce without any changes -- ie it fails with the current GNU version (9.2.0). SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chrysalis_gnu

minxu74 commented 3 years ago

Same error happened on Summit with gcc/9.1.0. The other tests failed with the same errors on Summit are:

ndkeen commented 3 years ago

Can reproduce this error on cori-knl with this branch ndk/machinefiles/cori-test-gnu9 that adds gnu9 and gnu10 compilers as compiler options.

dqwu commented 3 years ago

This issue can also be reproduced with GCC 9.1.0 on Summit.

Merge minxu74/summit/upd_softenv_compilers branch and then run ne4 F case (--compset F2010 --res ne4_oQU240) with gnu compiler.

Output files contain: abort_seaice_0001-01-01_00.00.00.nc abort_seaice_0001-01-01_00.00.00_block_13.nc 'log.seaice.0001.d.err' 'log.seaice.0007.d.err' 'log.seaice.0013.d****.err'

jonbob commented 3 years ago

I ran some tests and can confirm it fails SMS_Ld3.T62_oQU120.DTESTM.cori-knl_gnu10, in which only seaice is active. It passes SMS_Ld3_D.T62_oQU120.DTESTM.cori-knl_gnu10, so it appears to be something having to do with optimization. Also SMS_Ld3.T62_oQU120.CMPASO-IAF.cori-knl_gnu10 runs fine, so apparently it's not the MPAS components or framework but something specific to mpassi

ndkeen commented 3 years ago

I also see the ERROR: picard convergence failed! using simple F case (that uses the new compset/grid combo: ne30pg2_EC30to60E2r2.F2010) when using gnu version 9.

Same with new hires F case compset/grid ne120pg2_r05_EC30to60E2r2.F2010

Note that currently, need the changes in this branch to launch this compset/grid azamat/benching/update-v2-grids-compsets

Should we make a different issue to document the odd filenames being written when it fails?

-rw-rw-r--  1 ndk ndk   1321201 Sep 28 14:14  log.seaice.0978.err
-rw-rw-r--  1 ndk ndk    114293 Sep 28 14:14 'log.seaice.0976.d****.err'
-rw-rw-r--  1 ndk ndk    412916 Sep 28 14:14 'log.seaice.0962.d****.err'
-rw-rw-r--  1 ndk ndk   1305170 Sep 28 14:14 'log.seaice.0960.d****.err'
jonbob commented 3 years ago

Yes, I think anything that uses mpas-seaice is going to have trouble until we figure this out. I've engaged the developers and they are working on it

ndkeen commented 3 years ago

OK. I also verified that running ne30pg2_EC30to60E2r2.F2010 with the branch azamat/benching/update-v2-grids-compsets will fail on chrysalis using gnu compiler (which uses v9.1).

rljacob commented 3 years ago

Does this happen with gnu10?

jonbob commented 3 years ago

yes, it also fails with gnu10

ndkeen commented 2 years ago

Closing as this was same issue as #4584 with is closed with PR #4672