mpassi picard convergence failed in F-case runs

wlin7 commented 2 years ago

The failure is captured in scream F-case tests on mappy that are not run in debug mode: e.g.,

ERP_Ln22.ne4pg2_oQU480.F2010-SCREAMv1.mappy_gnu9.atmlndactive-rtm_off
ERS_Ln22.ne30pg2_EC30to60E2r2.F2010-SCREAMv1.mappy_gnu9.atmlndactive-rtm_off
PEM_Ln90.ne30pg2_EC30to60E2r2.F2010-SCREAMv1.mappy_gnu9

The runs with DEBUG mode are ok (e..g, ERS_D_Ln22.ne4pg2_oQU480.F2010-SCREAMv1.mappy_gnu9.atmlndactive-rtm_off)

The problem appears to be limited to mappy_gnu9. Same tests run ok on cori-knl_intel, ascent_gnugpu (ERP and PEM tests can run, but there are separate issues with ERP and PEM tests)

The first few lines of the error messages are

ERROR:  -------------------------------------
ERROR:
ERROR:  picard convergence failed!
ERROR:  ==========================
ERROR:
ERROR:  Surface: Tsf0, Tsf
ERROR:            0  -1.5518241246345625E-002  -1.5518241246345625E-002
ERROR:
ERROR:  Snow: zTsn0(k), zTsn(k), zqsn0(k), ks(k), Sswabs(k)
ERROR:            1   0.0000000000000000        0.0000000000000000       -110121000.00000000        5.0748471383885514E-315   1.4242374018559045E-004
ERROR:            2   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        2.8507717591509711E-004
ERROR:            3   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        2.8506068425131106E-004
ERROR:            4   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        2.8504419328220386E-004
ERROR:            5   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        2.8503306846455615E-004
ERROR:
ERROR:  Ice: zTin0(k), zTin(k), zSin0(k), zSin(k), phi(k), zqin0(k), km(k), Iswabs(k), dSdt(k)
ERROR:            1  -1.5669014377443815       -1.1483111027943425       0.28669488070650007       0.28669488070650007        1.0187631769471800E-002  -305962105.53206354        2.2820442990063059        39.671971386944676       -0.0000000000000000
ERROR:            2  -1.8969937920754070       -1.8217913690068226        1.3635776715090122        1.3635776715090122        4.0260210251142084E-002  -297531082.94632715        2.2290413794323620        2.6795040427153185       -0.0000000000000000
ERROR:            3  -1.9170322212976172       -1.8998275638189550        2.2729906124362880        2.2729906124362880        6.6433272399213275E-002  -289670991.45763475        2.1829113573963865        1.8657709229851689       -0.0000000000000000
ERROR:            4  -1.9239646293470931       -1.9145641701978748        2.7976235394494950        2.7976235394494950        8.1482310802327998E-002  -285142888.34896457        2.1563874272108969        1.4803309560091980       -0.0000000000000000
ERROR:            5  -1.9300147751069721       -1.9228302713083494        3.0534043034900389        3.0534043034900389        8.8662855449247471E-002  -282988343.21798432        2.1437317172707013        1.1763423061030871       -0.0000000000000000
ERROR:            6  -1.9358092128448650       -1.9284727307531435        3.1617892630245010        3.1617892630245010        9.1544720521249276E-002  -282130884.04962820        2.1386524300812981       0.91333060486698547       -0.0000000000000000
ERROR:            7  -1.9396863755549438       -1.9119262337646825        3.1967991129058810        3.1967991129058810        9.2379755267141359E-002  -281887150.63035828        2.1371806813416634        5.9601763175954110       -0.0000000000000000
ERROR:
ERROR:  Ice boundary: q(k)
ERROR:            0   0.0000000000000000
ERROR:            1   0.0000000000000000
ERROR:            2   0.0000000000000000
ERROR:            3   0.0000000000000000
ERROR:            4   0.0000000000000000
ERROR:            5   0.0000000000000000
ERROR:            6   0.0000000000000000
ERROR:            7   0.0000000000000000
ERROR:
ERROR:  dt:          1800.0000000000000
ERROR:  hilyr:      0.14285714285714285
ERROR:  hslyr:       2.2395386215650115E-008
ERROR:  Tbot:       -1.9452042139289627
ERROR:  fswint:      53.748709176081938
ERROR:  fswsfc:     0.95745738885343168
ERROR:  rhoa:        1.2007377975844544
ERROR:  flw:         315.67065769398960
ERROR:  potT:        279.08200226855087
ERROR:  Qa:          3.7165813980616479E-003
ERROR:  shcoef:      39.246328372928367
ERROR:  lhcoef:      107565.90315871530
ERROR:  qpond:       0.0000000000000000
ERROR:  qocn:       -7975134.9758704985
ERROR:  Spond:       0.0000000000000000
ERROR:  sss:         34.700000000000003
ERROR:  w:           0.0000000000000000
ERROR:  flwoutn:    -315.56525722151355
ERROR:  fsensn:      233.41834293222894
ERROR:  flatn:      -37.105662405730875
ERROR:  fsurfn:      197.37553838782753
ERROR:  fcondtop:    36.191168883109036
ERROR:  fcondbot:   0.99569478849638959
ERROR:  fadvheat:    0.0000000000000000
ERROR:
ERROR:  -------------------------------------
ERROR:  temperature_changes_salinity: Picard solver non-convergence (no snow)
ERROR: column_vertical_thermodynamics: ice: Vertical thermo error: picard_solver: Picard solver non-convergence
ERROR: iCell: 210611
ERROR: config_dt: 1800.00000000000
ERROR: nCategories: 5
ERROR: nIceLayers: 7
ERROR: nSnowLayers: 5
ERROR: nAerosols: 0
ERROR: openWaterArea: 0.994771899963043
ERROR: iceAreaCategoryInitial: 0.00000000000000 0.522810003695731E-02 0.00000000000000 0.00000000000000 0.00000000000000
ERROR: iceVolumeCategoryInitial: 0.00000000000000 0.522810003695731E-02 0.00000000000000 0.00000000000000 0.00000000000000
ERROR: snowVolumeCategoryInitial: 0.00000000000000 0.585426597508568E-09 0.00000000000000 0.00000000000000 0.00000000000000
ERROR: iceAreaCell: 0.522810003695731E-02

wlin7 commented 2 years ago

Hi @jonbob , @akturner , any idea what we can do about this? The error occur after atm model run 2 and 3 steps, for ne30pg2 and ne4pg2, respectively. The problem appears to be machine (mappy) and compiler (gnu) specific.

The problem may not be re-producible with E3SM model (Note: here it runs with EAMxx from scream code base) A similar test with E3SM master on mappy runs fine: ERS_Ld3.ne4pg2_oQU480.F2010.mappy_gnu.eam-thetahy_sl_pg2

wlin7 commented 2 years ago

Note: scream repo's wlin/atm/fcase_with_mosart_mpassi branch can be used to reproduce this issue now that PR #1861 is reverted.

ambrad commented 2 years ago

Separate issue but for this same effort: From https://my.cdash.org/test/60628244,

svn export failed with output:  and errput svn: E170000: URL 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/ice/mpas-cice/EC30to60E2r2/mpas-seaice.graph.info.200908.part.8' doesn't exist

May we run metis on Chrysalis and add this file? Or is it more complicated than that?

jonbob commented 2 years ago

@ambrad - you can run metis to generate that file -- or I can if it would help

ambrad commented 2 years ago

@jonbob Great. I'll do it. Just didn't want to write to that directory without confirmation that it's OK.

jonbob commented 2 years ago

@ambrad -- no problem with that. I think people often need new partition files and should not hesitate to make them.

wlin7 commented 2 years ago

For the failed test with mapp_gnu9, the actual fatal message appear to be the following, which could be triggered by the picard convergence failure, because the fatal message was reported when trying to create 'abort_seaice' file for debugging.

PIO: FATAL ERROR: Aborting... An error occured, Creating file (abort_seaice_0001-01-01_00.00.00.nc) failed. Invalid iotype (PIO_IOTYPE_PNETCDF:1) specified. Available iotypes are : PIO_IOTYPE_NETCDF (2), PIO_IOTYPE_NETCDF4C (3), PIO_IOTYPE_NETCDF4P (4). Bad IO type (err=-500).

ndkeen commented 2 years ago

On pm-cpu, I was able to repeat the error noted above a few days ago, but PM scratch is still down. Using CFS today, I'm able to reproduce the error. Example of trouble case:

/global/cfs/cdirs/e3sm/ndk/e3sm_scratch/pm-cpu/wlinmpassi/SMS_P96.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off.20220825_144814_ocra36

Just like with https://github.com/E3SM-Project/E3SM/issues/4584, I see the odd filenames:

-rw-rw-r--  1 ndk e3sm   19874 Aug 25 14:56 'log.seaice.0081.d****.err'
-rw-rw-r--  1 ndk e3sm   76776 Aug 25 14:56  log.seaice.0083.d0544.err

These log.seacice files contain the picard error message.

The issue may happening with threads, and only when the total number of threads is more than the number of elements.

These pass:
SMS_P16x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P32x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P42x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P72x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P96x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off

SMS_P16x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P32x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P42x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off

These fail:
SMS_P64x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P72x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P84x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P96x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off

The picard error is only present with 96x2 case. The 64x2 case has a different error, while the 72x2 and 84x2 cases appear to be hanging. The 64x2 case fails with Departure point is outside of halo: that is noted below as it happens on cori-knl.

On cori-knl, I tried a few cases with GNU (uses v10, but v11 also installed). Similar pass/fail pattern as above, but the fails are Departure point is outside of halo:

All cases with 1 thread are OK
SMS_P96x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off

As well as those with threads that total .le. 96
SMS_P48x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off

and these fail:
SMS_P64x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off
SMS_P72x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off
SMS_P96x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off

The fail on cori-knl:

 0: Atmosphere step = 1
 0:   model time = 0001-01-01 01:00:00
...
24:   what():  /global/cfs/cdirs/e3sm/ndk/wlin_atm_fcase_with_mosart_mpassi/components/homme/src/share/compose/compose_slmm_islmpi_adp.cpp:56: The condition:
24: true
24: led to the exception
24: Departure point is outside of halo:
24:   nearest point permitted: 1
24:   elem LID 0 elem GID 61 (lev, k) (71, 0) v -0.38263693981193037 0.92376728955889942 0.015587399794883159
24:   tgt_idx 3 local mesh:
24:   (mesh nnode 84 nelem 21
24:   (elem 0 ( 0 1 2 3)
24:      (p -0.28108463771482067 -0.67859834454584689 0.67859834454584711)
24:      (nml -0.92387953251128674 5.9071136065740615e-16 -0.38268343236508967)
24:      (p -0.35740674433659342 -0.35740674433659331 0.86285620946101682)
24:      (nml -3.2777276054175821e-16 -0.92387953251128663 -0.38268343236508989)
24:      (p -0.67859834454584711 -0.28108463771482017 0.67859834454584711)
24:      (nml 0.70710678118654757 6.023748705727792e-18 0.70710678118654768)
24:      (p -0.57735026918962584 -0.57735026918962562 0.57735026918962584)
24:      (nml 1.2820685881933754e-16 0.70710678118654768 0.70710678118654746)))
24:   (elem 1 ( 4 5 6 7)

@ndkeen

jonbob commented 2 years ago

@wlin7 - I looked at one of Noel's runs on perlmutter and am curious if this is always happening at the very beginning of the run. On the case I looked at, the seaice had erred before the atm had even run, so it was reacting to whatever the cpl was sending it initially? I may not understand how SCREAM is initializing compared to EAM, but there may be some field that is not getting set before getting sent to the seaice? If that's true, can we try to debug it by turning INFO_DEBUG to 3 to try and capture what is getting passed? And maybe turning on coupler history files for every step?

wlin7 commented 2 years ago

Sounds good, @jonbob . I am going to use an existing case that ran well on chrysalis to do that: with info_debug = 3 and hist_n = 1. I will share the path of the output with you shortly.

If it is because of some missing fields, then what at issue could be how the variables are initialized in non-DEBUG mode. The same test was ok on several machines.

ambrad commented 2 years ago

In the mappy nightlies, these two tests show this error:

PEM_Ln90.ne30pg2_EC30to60E2r2.F2010-SCREAMv1-MPASSI.mappy_gnu9
ERP_Ln22.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.mappy_gnu9.atmlndactive-rtm_off

AaronDonahue commented 1 year ago

@PeterCaldwell and/or @wlin7 , my understanding is that we are not focused on using MPASSI in the near future. Is this still an issue we should be tracking?

PeterCaldwell commented 1 year ago

I tasked @singhbalwinder with running SCREAMv1 in coupled mode, which will require this fix. So I guess we should leave it open.

E3SM-Project / scream

mpassi picard convergence failed in F-case runs #1878