Closed ndkeen closed 2 years ago
I am still working to understand what's happening -- so not sure if my experiments make sense yet.
Do all those cases have the same error in the sea ice log? Because some of them use the new prescribed sea ice and others are full sea ice.
Yes for all of those above tests, I see ERROR: picard convergence failed!
in at least one of the log.seaice* files.
The "picard convergence fail" means that vertical thermodynamics is not working. Quite possibly not a problem with the sea ice model, but with the atmosphere. Check the minimum surface temperature in EAM. The last time this failure of convergence was a problem, minimum temperatures less than -270K were being produced by the atmosphere. Another possibility is in sea ice advection. I can check that. Please can you point me to the location of these tests?
@proteanplanet - because this is occurring in F-cases, I don't think it's advection. Please note this is completely compiler-dependent and does not occur in debug mode, so it's likely some particular optimization is causing the issue. @akturner and I were working on it, but he's on vacation right now
@proteanplanet - I'll try a D-case today and see if we run into the same problem when seaice is the only active component. Otherwise, you're right that it could be something from the atm, which is also active in all these test failures
The D-cases fail as well. I had forgotten that @akturner and I had run them as well before digging in, but I also tried another just to be sure. So the problem is in mpassi and has to do with optimization in the gnu compiler
I have been slowly iterating on this and have found that I can turn off a specific optimization flag and these tests will pass. At least on chrysalis and perlmutter (which both use gnu version 9). I just add -fno-tree-pta
to the fortran non-debug flags for all sources. I don't see any serious performance concerns -- would need to run longer to know. Interestingly, I can also turn off a different flag and the tests will also work: -fno-tree-dsa
.
I know that we only need to add this flag to the sources in mpas-seaice, however, it may be easier to add it globally.
From the GNU documentation, these flags do:
-ftree-pta
Perform function-local points-to analysis on trees.
This flag is enabled by default at -O and higher.
-ftree-dse
Perform dead store elimination (DSE) on trees. A dead store is a store into a memory location
that is later overwritten by another store without any intervening loads. In this case the earlier
store can be deleted. This flag is enabled by default at -O1 and higher.
I also tried -O0 -ftree-pta
(ie, no optimization, except explicitly turn this one on) and the one test I was looking at would run just fine (SMS.T62_oQU120_ais20.MPAS_LISIO_TEST
). In fact, I even added all of the optimization flags that would be given by -O1
on top of -O0
and it ran. The same with -O2
. Which might mean there is some issue with several combinations of optimization flags. To me, these optimization do not sound like they are impacting floating-point computations.
I do not think we can say it is a compiler bug as the ultimate issue is the solver fails to converge. Or at least, it may not be easy to create a small reproducer to demonstrate this as writing out intermediate floating-point values may not show something obvious (I did try this a little bit). It seems plausible that the solver could be extra sensitive.
With the test SMS_P128x2.ne30pg2_ne30pg2.F2010-CICE.perlmutter_gnu
, I ran with and without adding the -fno-tree-pta
and I see 1.52 SYPD for both (which is only 5 days on 2 nodes).
That's great work, @ndkeen. I looked around a little and there seems to be a known bug with -fno-tree-pta? At least I found this: Bug #96522... But I couldn't find much documentation of what that flag actually does, beyond the part you posted. I think adding that to the non-debug fortran flags seems like an excellent solution.
Huh -- I also did some google searching, but I missed that. So I would think that anything they found would be included in gcc10, right? Because I still see the solver fails with gcc10 -- I can verify if adding this compiler flag also works with gcc10.
I found a document about general "points-to analysis" here: http://www.cs.cmu.edu/afs/cs/academic/class/15745-s06/web/handouts/ghiya-pldi01.pdf
It looks like that bug is in gcc10? It says it's known to fail for 10.2.0
I haven't yet tried gnu11. On PM, I see gcc/11.2.0
Just tried it (which involves a couple of minor changes needed to build with gnu10+) and I still see the error.
Now trying with -fno-tree-pta
-- and it seems to be OK. At least for SMS.T62_oQU120_ais20.MPAS_LISIO_TEST
.
@ndkeen - what do you think about adding the -fno-tree-pta to the gnu9/10 non-debug flags? It would be good to get these issues closed
We can do better by only throwing this flag for mpas-seaice/src/column/ice_shortwave.F90
Did you figure out a way to do that for an mpas file?
You can just rebuild that one file.
Automatically? Or just for testing?
Just for testing
Interesting that that's not the file where the error is occurring
@ndkeen - I'm not sure anyone has the time to go through that code when it works fine using other compilers and gnu in debug mode. It might make more sense to add that flag to gnu9/10 so that these cases will work on platforms that require newer versions of gnu
@ndkeen: I guess its unclear to me how this could be NOT a compiler issue?
I'm happy to make fixes to the code, however, should definitive evidence emerge that there's an issue with the code not the compilers.
Thanks @ndkeen for figuring out which file and which flag were causing the problem. I tested it with a hack into the Depends file and it works!
I made this branch and am testing.
ndk/machinefiles/add-compiler-flag-for-one-mpas-source-with-gnu
This would alter compilation on that one file for all GNU versions (on all machines), but I verified it was OK on cori-knl with gnu8 (and gnu9), so maybe this is path of least resistance?
That sounds great. If there is a performance hit at all on gnu8, I don't think it matters much since we don't really use gnu for production runs. And it's necessary for gnu9+. Thanks again for tracking this down
Also, I was able to rule out large sections of the code in that source file. It appears to be in the routine run_dedd. If I make a new source file, with a new module containing only that routine, it will fail unless I add the -fno-tree-pta
flag.
Sounds like you did a lot of work. I looked briefly at that routine and don't see anything that jumps out, but I really don't know that part of the code. Thanks again
@ndkeen - how is the testing of this going? I think your fix is perfect and would be good to get in
I'm not confident this is the root issue, but it would take work to investigate. If we do what's easiest and add this flag to all GNU builds, there may be a fair amount of testing as not sure it will be BFB? So I'm not sure what is best path.
We can just add it for gnu9 and gnu10 -- but the automated testing will catch non-BFB behavior if you want to try it for all gnu compilers
Actually, this might be a good use of the new feature of cmake macros files (right @jgfouca ?) -- we could add it for GNU major versions 9 and above (only for this file).
I have tested ndk/machinefiles/add-compiler-flag-for-one-mpas-source-with-gnu branch on ANL GCE with GCC 11. It has fixed all e3sm_developer tests that previously failed with "picard convergence failed!" error.
There are no production runs in progress with gnu so a bfb change would be ok.
Thanks for testing @dqwu -- ah, yes that's another good point Rob. Would you be OK with adding this flag (for this one file) for all GNU builds?
Yes but using the Cmake macros feature to limit this to versions >= 9 would also work.
@ndkeen , @rljacob , the new cmake macro system will allow you to do version checking in the macros. In the previous system, if you needed different flags for gnu8 and gnu9, you'd have to make two entire compiler block definitions in configcompilers.xml. With the new system, you can call the compiler the same name, (gnu
in this case), and add version checking to either the gnu.cmake
or `gnu$machine.cmake`. If you only want these flag changes for specific files, you'll have to use the Depends file system. I think version checking should work fine there as well but we haven't tested that.
An example of a version check:
if (CMAKE_Fortran_COMPILER_VERSION VERSION_GREATER_EQUAL 10)
string(APPEND FFLAGS " -fallow-argument-mismatch -fallow-invalid-boz ")
endif()
Thanks Jim. I now have:
# For optimized GNU builds that use v9 or higher, remove an optimization on one file
if (NOT DEBUG)
if (CMAKE_Fortran_COMPILER_VERSION VERSION_GREATER_EQUAL 9)
foreach(ITEM IN LISTS MPAS_ICE_SHORTWAVE)
e3sm_add_flags("${ITEM}" "-fno-tree-pta") # avoids an error that shows up in solver with gnu9 and higher
endforeach()
endif()
endif()
And I tested with current GNU (v8) and v9. It seems to work. I updated the branch.
@ndkeen that is great news, thanks!
Already noted here https://github.com/E3SM-Project/E3SM/issues/4495
I just wanted to show that all 9 of these tests in e3sm_developer are failing in the same way using GNU compiler v9 or higher. This is default version on chrysalis and many new machines becoming available use gnu9 or higher.
There is little info in the e3sm.log file for the tests. Need to look in run/ dir to spot the
log.seaice.*
files.With error message found in a file such as
run/log.seaice.0039.d0001.err
: