Open wlin7 opened 2 years ago
Hi @jonbob , @akturner , any idea what we can do about this? The error occur after atm model run 2 and 3 steps, for ne30pg2 and ne4pg2, respectively. The problem appears to be machine (mappy) and compiler (gnu) specific.
The problem may not be re-producible with E3SM model (Note: here it runs with EAMxx from scream code base) A similar test with E3SM master on mappy runs fine: ERS_Ld3.ne4pg2_oQU480.F2010.mappy_gnu.eam-thetahy_sl_pg2
Note: scream repo's wlin/atm/fcase_with_mosart_mpassi branch can be used to reproduce this issue now that PR #1861 is reverted.
Separate issue but for this same effort: From https://my.cdash.org/test/60628244,
svn export failed with output: and errput svn: E170000: URL 'https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/ice/mpas-cice/EC30to60E2r2/mpas-seaice.graph.info.200908.part.8' doesn't exist
May we run metis on Chrysalis and add this file? Or is it more complicated than that?
@ambrad - you can run metis to generate that file -- or I can if it would help
@jonbob Great. I'll do it. Just didn't want to write to that directory without confirmation that it's OK.
@ambrad -- no problem with that. I think people often need new partition files and should not hesitate to make them.
For the failed test with mapp_gnu9, the actual fatal message appear to be the following, which could be triggered by the picard convergence failure, because the fatal message was reported when trying to create 'abort_seaice' file for debugging.
PIO: FATAL ERROR: Aborting... An error occured, Creating file (abort_seaice_0001-01-01_00.00.00.nc) failed. Invalid iotype (PIO_IOTYPE_PNETCDF:1) specified. Available iotypes are : PIO_IOTYPE_NETCDF (2), PIO_IOTYPE_NETCDF4C (3), PIO_IOTYPE_NETCDF4P (4). Bad IO type (err=-500).
On pm-cpu, I was able to repeat the error noted above a few days ago, but PM scratch is still down. Using CFS today, I'm able to reproduce the error. Example of trouble case:
/global/cfs/cdirs/e3sm/ndk/e3sm_scratch/pm-cpu/wlinmpassi/SMS_P96.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off.20220825_144814_ocra36
Just like with https://github.com/E3SM-Project/E3SM/issues/4584, I see the odd filenames:
-rw-rw-r-- 1 ndk e3sm 19874 Aug 25 14:56 'log.seaice.0081.d****.err'
-rw-rw-r-- 1 ndk e3sm 76776 Aug 25 14:56 log.seaice.0083.d0544.err
These log.seacice files contain the picard error message.
The issue may happening with threads, and only when the total number of threads is more than the number of elements.
These pass:
SMS_P16x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P32x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P42x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P72x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P96x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P16x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P32x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P42x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
These fail:
SMS_P64x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P72x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P84x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
SMS_P96x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu.atmlndactive-rtm_off
The picard error is only present with 96x2 case. The 64x2 case has a different error, while the 72x2 and 84x2 cases appear to be hanging. The 64x2 case fails with Departure point is outside of halo:
that is noted below as it happens on cori-knl.
On cori-knl, I tried a few cases with GNU (uses v10, but v11 also installed). Similar pass/fail pattern as above, but the fails are Departure point is outside of halo:
All cases with 1 thread are OK
SMS_P96x1_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off
As well as those with threads that total .le. 96
SMS_P48x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off
and these fail:
SMS_P64x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off
SMS_P72x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off
SMS_P96x2_Ln9.ne4pg2_oQU480.F2010-SCREAMv1.cori-knl_gnu.atmlndactive-rtm_off
The fail on cori-knl:
0: Atmosphere step = 1
0: model time = 0001-01-01 01:00:00
...
24: what(): /global/cfs/cdirs/e3sm/ndk/wlin_atm_fcase_with_mosart_mpassi/components/homme/src/share/compose/compose_slmm_islmpi_adp.cpp:56: The condition:
24: true
24: led to the exception
24: Departure point is outside of halo:
24: nearest point permitted: 1
24: elem LID 0 elem GID 61 (lev, k) (71, 0) v -0.38263693981193037 0.92376728955889942 0.015587399794883159
24: tgt_idx 3 local mesh:
24: (mesh nnode 84 nelem 21
24: (elem 0 ( 0 1 2 3)
24: (p -0.28108463771482067 -0.67859834454584689 0.67859834454584711)
24: (nml -0.92387953251128674 5.9071136065740615e-16 -0.38268343236508967)
24: (p -0.35740674433659342 -0.35740674433659331 0.86285620946101682)
24: (nml -3.2777276054175821e-16 -0.92387953251128663 -0.38268343236508989)
24: (p -0.67859834454584711 -0.28108463771482017 0.67859834454584711)
24: (nml 0.70710678118654757 6.023748705727792e-18 0.70710678118654768)
24: (p -0.57735026918962584 -0.57735026918962562 0.57735026918962584)
24: (nml 1.2820685881933754e-16 0.70710678118654768 0.70710678118654746)))
24: (elem 1 ( 4 5 6 7)
@ndkeen
@wlin7 - I looked at one of Noel's runs on perlmutter and am curious if this is always happening at the very beginning of the run. On the case I looked at, the seaice had erred before the atm had even run, so it was reacting to whatever the cpl was sending it initially? I may not understand how SCREAM is initializing compared to EAM, but there may be some field that is not getting set before getting sent to the seaice? If that's true, can we try to debug it by turning INFO_DEBUG to 3 to try and capture what is getting passed? And maybe turning on coupler history files for every step?
Sounds good, @jonbob . I am going to use an existing case that ran well on chrysalis to do that: with info_debug = 3 and hist_n = 1. I will share the path of the output with you shortly.
If it is because of some missing fields, then what at issue could be how the variables are initialized in non-DEBUG mode. The same test was ok on several machines.
In the mappy nightlies, these two tests show this error:
@PeterCaldwell and/or @wlin7 , my understanding is that we are not focused on using MPASSI in the near future. Is this still an issue we should be tracking?
I tasked @singhbalwinder with running SCREAMv1 in coupled mode, which will require this fix. So I guess we should leave it open.
The failure is captured in scream F-case tests on mappy that are not run in debug mode: e.g.,
The runs with DEBUG mode are ok (e..g, ERS_D_Ln22.ne4pg2_oQU480.F2010-SCREAMv1.mappy_gnu9.atmlndactive-rtm_off)
The problem appears to be limited to mappy_gnu9. Same tests run ok on cori-knl_intel, ascent_gnugpu (ERP and PEM tests can run, but there are separate issues with ERP and PEM tests)
The first few lines of the error messages are