Closed ndkeen closed 2 years ago
On cori-haswell, the same test also fails when I try using gcc/9.3.0. However, at least the filenames aren't weird. Here is a message from one of the log.seaice*err files:
----------------------------------------------------------------------
Beginning MPAS-seaice Error Log File for task 52 of 64
Opened at 2021/08/26 14:32:51
----------------------------------------------------------------------
ERROR: -------------------------------------
ERROR:
ERROR: picard convergence failed!
ERROR: ==========================
ERROR:
ERROR: Surface: Tsf0, Tsf
ERROR: 0 -3.9155511566211860 -8.1992955785862041
ERROR:
ERROR: Snow: zTsn0(k), zTsn(k), zqsn0(k), ks(k), Sswabs(k)
ERROR: 1 0.0000000000000000 0.0000000000000000 -110121000.00000000 2.8855013936690457E-315 6.0813540619779087E-003
ERROR: 2 0.0000000000000000 0.0000000000000000 -110121000.00000000 0.0000000000000000 1.2205973969978724E-002
ERROR: 3 0.0000000000000000 0.0000000000000000 -110121000.00000000 0.0000000000000000 1.4184852943987516E-002
ERROR: 4 0.0000000000000000 0.0000000000000000 -110121000.00000000 0.0000000000000000 1.4172771222996426E-002
ERROR: 5 0.0000000000000000 0.0000000000000000 -110121000.00000000 0.0000000000000000 1.4176870291731538E-002
ERROR:
ERROR: Ice: zTin0(k), zTin(k), zSin0(k), zSin(k), phi(k), zqin0(k), km(k), Iswabs(k), dSdt(k)
ERROR: 1 -15.943060879665545 -9.8200207555060697 0.29457371164102458 0.29457371164102458 1.5936914836161479E-003 -336524102.12608898 2.2971911187601264 59.372514414292624 -0.
0000000000000000
ERROR: 2 -14.783856749613904 -11.130389257538972 1.3902275385133005 1.3902275385133005 7.8624412443636972E-003 -332551250.66417003 2.2861424473068088 9.0199110846378012 -0.
0000000000000000
ERROR: 3 -12.851706376301790 -10.730683325412805 2.2997965722540088 2.2997965722540088 1.4100777439189024E-002 -327031215.22958064 2.2751473797634292 5.3488797815183728 -0.
0000000000000000
ERROR: 4 -10.535219881415051 -9.3230185284908842 2.8162619364877655 2.8162619364877655 1.9285407108742349E-002 -320994551.22147530 2.2660094699708413 3.3436685698739375 -0.
0000000000000000
ERROR: 5 -7.9997675009115170 -7.3347754312024422 3.0636821823713998 3.0636821823713998 2.4212126760918416E-002 -314543815.93350124 2.2573261265838811 2.2229553594109261 -0.
0000000000000000
ERROR: 6 -4.7610297889587452 -5.0141592526030827 3.6087679131069565 3.6087679131069565 4.4625030703673858E-002 -302049792.14898217 2.2213483833847745 1.5721067256749337 -0.
0000000000000000
ERROR: 7 -2.3437151243269518 -2.7311241160277553 6.9764388293998980 6.9710500369156359 0.16805101643073556 -259979309.73695552 2.0038100835408286 21.144585959321155 -1.
4968868011838672E-006
ERROR:
ERROR: Ice boundary: q(k)
ERROR: 0 0.0000000000000000
ERROR: 1 0.0000000000000000
ERROR: 2 0.0000000000000000
ERROR: 3 0.0000000000000000
ERROR: 4 0.0000000000000000
ERROR: 5 0.0000000000000000
ERROR: 6 0.0000000000000000
ERROR: 7 0.0000000000000000
Same behaviour on chrysalis, but you can reproduce without any changes -- ie it fails with the current GNU version (9.2.0).
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chrysalis_gnu
Same error happened on Summit with gcc/9.1.0. The other tests failed with the same errors on Summit are:
Can reproduce this error on cori-knl with this branch ndk/machinefiles/cori-test-gnu9
that adds gnu9 and gnu10 compilers as compiler options.
This issue can also be reproduced with GCC 9.1.0 on Summit.
Merge minxu74/summit/upd_softenv_compilers branch and then run ne4 F case (--compset F2010 --res ne4_oQU240) with gnu compiler.
Output files contain: abort_seaice_0001-01-01_00.00.00.nc abort_seaice_0001-01-01_00.00.00_block_13.nc 'log.seaice.0001.d.err' 'log.seaice.0007.d.err' 'log.seaice.0013.d****.err'
I ran some tests and can confirm it fails SMS_Ld3.T62_oQU120.DTESTM.cori-knl_gnu10, in which only seaice is active. It passes SMS_Ld3_D.T62_oQU120.DTESTM.cori-knl_gnu10, so it appears to be something having to do with optimization. Also SMS_Ld3.T62_oQU120.CMPASO-IAF.cori-knl_gnu10 runs fine, so apparently it's not the MPAS components or framework but something specific to mpassi
I also see the ERROR: picard convergence failed!
using simple F case (that uses the new compset/grid combo: ne30pg2_EC30to60E2r2.F2010
) when using gnu version 9.
Same with new hires F case compset/grid ne120pg2_r05_EC30to60E2r2.F2010
Note that currently, need the changes in this branch to launch this compset/grid azamat/benching/update-v2-grids-compsets
Should we make a different issue to document the odd filenames being written when it fails?
-rw-rw-r-- 1 ndk ndk 1321201 Sep 28 14:14 log.seaice.0978.err
-rw-rw-r-- 1 ndk ndk 114293 Sep 28 14:14 'log.seaice.0976.d****.err'
-rw-rw-r-- 1 ndk ndk 412916 Sep 28 14:14 'log.seaice.0962.d****.err'
-rw-rw-r-- 1 ndk ndk 1305170 Sep 28 14:14 'log.seaice.0960.d****.err'
Yes, I think anything that uses mpas-seaice is going to have trouble until we figure this out. I've engaged the developers and they are working on it
OK. I also verified that running ne30pg2_EC30to60E2r2.F2010
with the branch azamat/benching/update-v2-grids-compsets
will fail on chrysalis using gnu compiler (which uses v9.1).
Does this happen with gnu10?
yes, it also fails with gnu10
Closing as this was same issue as #4584 with is closed with PR #4672
ON cori-knl, we currently use gcc/8.3.0. Both
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST
andSMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST
complete with gnu using gcc/8.3.0 (though both show a 10% memory growth in the first 5 days which may be unrelated -- even more memory growth with Intel compiler).I tried updating to gcc/9.3.0 and I found an error with
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST
, however,SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST
still completes. Note the odd filenames: