E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 364 forks source link

Several tests in e3sm_developer now fail in MPAS-SEAICE when using GNU v9 or higher #4584

Closed ndkeen closed 2 years ago

ndkeen commented 3 years ago

Already noted here https://github.com/E3SM-Project/E3SM/issues/4495

I just wanted to show that all 9 of these tests in e3sm_developer are failing in the same way using GNU compiler v9 or higher. This is default version on chrysalis and many new machines becoming available use gnu9 or higher.

There is little info in the e3sm.log file for the tests. Need to look in run/ dir to spot the log.seaice.* files.

ERP_Ld3.ne4_oQU240.F2010
ERS_Ld3.ne4pg2_oQU480.F2010
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST
SMS.ne4_oQU240.F2010.machine_gnu.eam-cosplite
SMS_Ln5.ne4pg2_oQU480.F2010.machine_gnu.eam-thetahy_pg2
SMS_Ln5.ne4pg2_oQU480.F2010.machine_gnu.eam-thetahy_sl_pg2
SMS_Ln5.ne4pg2_oQU480.F2010
SMS_Ln9.ne4_oQU240.F2010.machine_gnu.eam-outfrq9s
SMS_P12x2.ne4_oQU240.WCYCL1850NS.machine_gnu.allactive-mach_mods

With error message found in a file such as run/log.seaice.0039.d0001.err:

ERROR:  -------------------------------------
ERROR: 
ERROR:  picard convergence failed!
ERROR:  ==========================
ERROR: 
ERROR:  Surface: Tsf0, Tsf
ERROR:            0  -4.8157786256928752       -4.9276324350521730
ERROR: 
ERROR:  Snow: zTsn0(k), zTsn(k), zqsn0(k), ks(k), Sswabs(k)
ERROR:            1   0.0000000000000000        0.0000000000000000       -110121000.00000000        2.2950296274355850E-315   5.9330250652307008E-003
ERROR:            2   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        1.1977949299187558E-002
ERROR:            3   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        1.1963449907972840E-002
ERROR:            4   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        1.1948969626128606E-002
ERROR:            5   0.0000000000000000        0.0000000000000000       -110121000.00000000        0.0000000000000000        1.2107473148328565E-002
ERROR: 
ERROR:  Ice: zTin0(k), zTin(k), zSin0(k), zSin(k), phi(k), zqin0(k), km(k), Iswabs(k), dSdt(k)
ERROR:            1  -2.6506533800702092       -2.5554668737942401       0.27346241415658046       0.27346241415658046        5.8561424412998074E-003  -309390747.31284755        2.2896785489472089  
      74.758674308262414       -0.0000000000000000
ERROR:            2  -1.9620739953749862       -2.0079121203735455        1.3545060281494217        1.3545060281494217        3.8710783577716679E-002  -298130661.20091468        2.2317722439442740  
ndkeen commented 3 years ago

I am still working to understand what's happening -- so not sure if my experiments make sense yet.

rljacob commented 3 years ago

Do all those cases have the same error in the sea ice log? Because some of them use the new prescribed sea ice and others are full sea ice.

ndkeen commented 3 years ago

Yes for all of those above tests, I see ERROR: picard convergence failed! in at least one of the log.seaice* files.

proteanplanet commented 3 years ago

The "picard convergence fail" means that vertical thermodynamics is not working. Quite possibly not a problem with the sea ice model, but with the atmosphere. Check the minimum surface temperature in EAM. The last time this failure of convergence was a problem, minimum temperatures less than -270K were being produced by the atmosphere. Another possibility is in sea ice advection. I can check that. Please can you point me to the location of these tests?

jonbob commented 3 years ago

@proteanplanet - because this is occurring in F-cases, I don't think it's advection. Please note this is completely compiler-dependent and does not occur in debug mode, so it's likely some particular optimization is causing the issue. @akturner and I were working on it, but he's on vacation right now

jonbob commented 3 years ago

@proteanplanet - I'll try a D-case today and see if we run into the same problem when seaice is the only active component. Otherwise, you're right that it could be something from the atm, which is also active in all these test failures

jonbob commented 3 years ago

The D-cases fail as well. I had forgotten that @akturner and I had run them as well before digging in, but I also tried another just to be sure. So the problem is in mpassi and has to do with optimization in the gnu compiler

ndkeen commented 3 years ago

I have been slowly iterating on this and have found that I can turn off a specific optimization flag and these tests will pass. At least on chrysalis and perlmutter (which both use gnu version 9). I just add -fno-tree-pta to the fortran non-debug flags for all sources. I don't see any serious performance concerns -- would need to run longer to know. Interestingly, I can also turn off a different flag and the tests will also work: -fno-tree-dsa.

I know that we only need to add this flag to the sources in mpas-seaice, however, it may be easier to add it globally.

From the GNU documentation, these flags do:

-ftree-pta
    Perform function-local points-to analysis on trees.
    This flag is enabled by default at -O and higher.

-ftree-dse
    Perform dead store elimination (DSE) on trees. A dead store is a store into a memory location
    that is later overwritten by another store without any intervening loads. In this case the earlier
    store can be deleted. This flag is enabled by default at -O1 and higher.

I also tried -O0 -ftree-pta (ie, no optimization, except explicitly turn this one on) and the one test I was looking at would run just fine (SMS.T62_oQU120_ais20.MPAS_LISIO_TEST). In fact, I even added all of the optimization flags that would be given by -O1 on top of -O0 and it ran. The same with -O2. Which might mean there is some issue with several combinations of optimization flags. To me, these optimization do not sound like they are impacting floating-point computations.

I do not think we can say it is a compiler bug as the ultimate issue is the solver fails to converge. Or at least, it may not be easy to create a small reproducer to demonstrate this as writing out intermediate floating-point values may not show something obvious (I did try this a little bit). It seems plausible that the solver could be extra sensitive.

With the test SMS_P128x2.ne30pg2_ne30pg2.F2010-CICE.perlmutter_gnu, I ran with and without adding the -fno-tree-pta and I see 1.52 SYPD for both (which is only 5 days on 2 nodes).

jonbob commented 3 years ago

That's great work, @ndkeen. I looked around a little and there seems to be a known bug with -fno-tree-pta? At least I found this: Bug #96522... But I couldn't find much documentation of what that flag actually does, beyond the part you posted. I think adding that to the non-debug fortran flags seems like an excellent solution.

ndkeen commented 3 years ago

Huh -- I also did some google searching, but I missed that. So I would think that anything they found would be included in gcc10, right? Because I still see the solver fails with gcc10 -- I can verify if adding this compiler flag also works with gcc10.

I found a document about general "points-to analysis" here: http://www.cs.cmu.edu/afs/cs/academic/class/15745-s06/web/handouts/ghiya-pldi01.pdf

jonbob commented 3 years ago

It looks like that bug is in gcc10? It says it's known to fail for 10.2.0

ndkeen commented 3 years ago

I haven't yet tried gnu11. On PM, I see gcc/11.2.0

Just tried it (which involves a couple of minor changes needed to build with gnu10+) and I still see the error.

Now trying with -fno-tree-pta -- and it seems to be OK. At least for SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.

jonbob commented 3 years ago

@ndkeen - what do you think about adding the -fno-tree-pta to the gnu9/10 non-debug flags? It would be good to get these issues closed

ndkeen commented 3 years ago

We can do better by only throwing this flag for mpas-seaice/src/column/ice_shortwave.F90

jonbob commented 3 years ago

Did you figure out a way to do that for an mpas file?

ndkeen commented 3 years ago

You can just rebuild that one file.

jonbob commented 3 years ago

Automatically? Or just for testing?

ndkeen commented 3 years ago

Just for testing

akturner commented 3 years ago

Interesting that that's not the file where the error is occurring

jonbob commented 3 years ago

@ndkeen - I'm not sure anyone has the time to go through that code when it works fine using other compilers and gnu in debug mode. It might make more sense to add that flag to gnu9/10 so that these cases will work on platforms that require newer versions of gnu

akturner commented 3 years ago

@ndkeen: I guess its unclear to me how this could be NOT a compiler issue?

akturner commented 3 years ago

I'm happy to make fixes to the code, however, should definitive evidence emerge that there's an issue with the code not the compilers.

jonbob commented 3 years ago

Thanks @ndkeen for figuring out which file and which flag were causing the problem. I tested it with a hack into the Depends file and it works!

ndkeen commented 3 years ago

I made this branch and am testing. ndk/machinefiles/add-compiler-flag-for-one-mpas-source-with-gnu

This would alter compilation on that one file for all GNU versions (on all machines), but I verified it was OK on cori-knl with gnu8 (and gnu9), so maybe this is path of least resistance?

jonbob commented 3 years ago

That sounds great. If there is a performance hit at all on gnu8, I don't think it matters much since we don't really use gnu for production runs. And it's necessary for gnu9+. Thanks again for tracking this down

ndkeen commented 3 years ago

Also, I was able to rule out large sections of the code in that source file. It appears to be in the routine run_dedd. If I make a new source file, with a new module containing only that routine, it will fail unless I add the -fno-tree-pta flag.

jonbob commented 3 years ago

Sounds like you did a lot of work. I looked briefly at that routine and don't see anything that jumps out, but I really don't know that part of the code. Thanks again

jonbob commented 3 years ago

@ndkeen - how is the testing of this going? I think your fix is perfect and would be good to get in

ndkeen commented 3 years ago

I'm not confident this is the root issue, but it would take work to investigate. If we do what's easiest and add this flag to all GNU builds, there may be a fair amount of testing as not sure it will be BFB? So I'm not sure what is best path.

jonbob commented 3 years ago

We can just add it for gnu9 and gnu10 -- but the automated testing will catch non-BFB behavior if you want to try it for all gnu compilers

ndkeen commented 3 years ago

Actually, this might be a good use of the new feature of cmake macros files (right @jgfouca ?) -- we could add it for GNU major versions 9 and above (only for this file).

dqwu commented 3 years ago

I have tested ndk/machinefiles/add-compiler-flag-for-one-mpas-source-with-gnu branch on ANL GCE with GCC 11. It has fixed all e3sm_developer tests that previously failed with "picard convergence failed!" error.

rljacob commented 3 years ago

There are no production runs in progress with gnu so a bfb change would be ok.

ndkeen commented 3 years ago

Thanks for testing @dqwu -- ah, yes that's another good point Rob. Would you be OK with adding this flag (for this one file) for all GNU builds?

rljacob commented 3 years ago

Yes but using the Cmake macros feature to limit this to versions >= 9 would also work.

jgfouca commented 3 years ago

@ndkeen , @rljacob , the new cmake macro system will allow you to do version checking in the macros. In the previous system, if you needed different flags for gnu8 and gnu9, you'd have to make two entire compiler block definitions in configcompilers.xml. With the new system, you can call the compiler the same name, (gnu in this case), and add version checking to either the gnu.cmake or `gnu$machine.cmake`. If you only want these flag changes for specific files, you'll have to use the Depends file system. I think version checking should work fine there as well but we haven't tested that.

An example of a version check:

if (CMAKE_Fortran_COMPILER_VERSION VERSION_GREATER_EQUAL 10)
   string(APPEND FFLAGS " -fallow-argument-mismatch  -fallow-invalid-boz ")
 endif() 
ndkeen commented 3 years ago

Thanks Jim. I now have:

# For optimized GNU builds that use v9 or higher, remove an optimization on one file
if (NOT DEBUG)
  if (CMAKE_Fortran_COMPILER_VERSION VERSION_GREATER_EQUAL 9)
    foreach(ITEM IN LISTS MPAS_ICE_SHORTWAVE)
      e3sm_add_flags("${ITEM}" "-fno-tree-pta") # avoids an error that shows up in solver with gnu9 and higher
    endforeach()
  endif()
endif()

And I tested with current GNU (v8) and v9. It seems to work. I updated the branch.

jgfouca commented 3 years ago

@ndkeen that is great news, thanks!