Closed ndkeen closed 6 years ago
Is this run with CICE or MPAS_CICE? Is this a clean build? CICE needs to know the layout at compile time.
As it is a F case, I thought there was no ice? It was a clean build.
If you look at the F compsets, you will see that most of them use CICE and CLM (exceptions are the aquaplanet and ideal compsets). For CICE builds, if you change the number of tasks, you have to completely rebuild the case (./case.build --clean-all; rm -rf bld; ./case.setup --clean; ./case.setup; ./case.build). Is that true for when you switched to 14400 tasks?
Ah ok, so there ice. This is FC5AV1C-04P. Yes, in most cases (and this one), I start with a new case directory and clean build. If this is a common error, I can try again. I did try again with 4 threads (starting with a new case) and I hit a different segfault. It may be that I'm near a memory limit.
this is an issue we've encountered on a regular basis with both the high resolution (0.1 degree POP/CICE) v0 B-cases, as well as the RRS18to6km MPAS G-cases. we're still not sure what causes the problem, but small changes in the setup can cause it to go away. for example, the first time we saw this was on Mira. we moved the run to Titan and the problem went away. you can change the optimization level and it will often go away. or change the dynamics subcycling in the sea ice model. we should probably figure out the root cause at some point.
sorry--hit the wrong button....
Thanks @maltrud . That's why I posted it -- maybe someone has seen it before, or maybe it will help someone verify it's "real". Not causing me a problem at the moment. I could run in some other ways if it helps to investigate.
The first thing to note is that failures in other parts of the system are often caught in the sea ice vertical thermodynamics since the thermodynamics is iterative with a convergence criterion. Unphysical values generated elsewhere will often propagate until they cause the sea ice vertical thermodynamics not to converge. Here we see a sea ice surface temperature of -1000 C, which suggests a problem with the atmospheric fluxes. Jon has found places in the atmosphere model where after a failure the model continues after adding -999 to fields (https://github.com/ACME-Climate/ACME/issues/1292). This may have happened here.
I have not seen this again in a while and I have been running various ne120 F cases.
Well I just did see this again. Using the beta release of Intel v18 on cori-knl.
This time with 2 tests in acme_dev: ERS_IOP.f45_g37_rx1.DTEST
and ERS.f45_g37_rx1.DTEST
5: Lat, Lon: 75.0526957285091 113.813909090145
5: ERROR: ice: Vertical thermo error
5: Image PC Routine Line Source
5: acme.exe 0000000020EC8408 Unknown Unknown Unknown
5: acme.exe 00000000204C686C Unknown Unknown Unknown
5: acme.exe (deleted 0000000020209BC0 Unknown Unknown Unknown
5: acme.exe 00000000202086D9 cice_runmod_mp_st 114 CICE_RunMod.F90
5: acme.exe 000000002010026B Unknown Unknown Unknown
5: acme.exe (deleted 00000000200270C6 Unknown Unknown Unknown
5: acme.exe 000000002000D58D Unknown Unknown Unknown
5: acme.exe 0000000020026DD2 MAIN__ 68 cesm_driver.F90
5: acme.exe 000000002000B1DE Unknown Unknown Unknown
5: acme.exe 0000000020FB1B60 Unknown Unknown Unknown
5: acme.exe 000000002000B0C7 Unknown Unknown Unknown
5: Rank 5 [Thu Jul 27 19:46:19 2017] [c5-1c2s7n3] application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 5
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/mmf-i18impib-again/ERS_IOP.f45_g37_rx1.DTEST.cori-knl_intel18.20170727_175124_yx4jhi
Is this error reproducible? If yes, we should try to find the root of this error. We might not be using CICE in future but, like @mt5555 mentioned, the error might be coming from somewhere else.
We can see if the debug mode reveals more info about this error. @ndkeen : Can you please try ERS_D.f45_g37_rx1.DTEST
to see if it also fails?
I realize I'm using a beta version of the compiler, but because I stumbled upon this with an acme_dev test, I thought it could make it easier to find. I started
create_test --machine=cori-knl --compiler=intel18 ERS_D_IOP.f45_g37_rx1.DTEST ERS_D.f45_g37_rx1.DTEST
Note: I'm using a branch where I've added this intel18 option for cori-knl. I started with master as of yesterday. I'm ready to make a PR to get it into repo as it's only an option. https://github.com/ACME-Climate/ACME/pull/1685
Those two _D tests did pass.
Ok. It seems like some kind of memory issue if it passes in debug mode but it can be some other issue also (compiler bug?). Unless there is a better alternative, I think we can proceed with the following:
If the error is reproducible in non-debug run, it might be useful to compile the code again using only -g flag (for producing debugging information) and using a debugger. But first we need to make sure that the code fails when we add only -g to the compiler options....
If debugging is taking too much time, we must first evaluate whether it is useful to debug this or not. @mt5555 , @philrasch and @rljacob: any thoughts on this?
This error is preventing me from using intel18 with a highres G case. Anything we can do to debug with the small test that I noted above?
For example, I just ran this again with a recent master:
./create_test ERS.f45_g37_rx1.DTEST --machine=cori-knl --compiler=intel18
And I get the following:
7: Thermo iteration does not converge,istep1, my_task, i, j: 1
7: 7 23 26
7: Ice thickness: 0.312148206064179
7: Snow thickness: 2.413462149830149E-002
7: dTsf, Tsf_errmax: -1.842970220877760E-014 5.000000000000000E-004
7: Tsf: -1.76631202746488
7: fsurf: -20.2628561160800
7: fcondtop, fcondbot, fswint -20.2628561160800 -23.5797515784896
7: 0.000000000000000E+000
7: fswsfc, fswthrun 0.000000000000000E+000 0.000000000000000E+000
7: Flux conservation error = 79033.8875786383
7: Internal snow absorption:
7: 0.000000000000000E+000
7: Internal ice absorption:
7: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
7: 0.000000000000000E+000
7: Initial snow temperatures:
7: 0.000000000000000E+000
7: Initial ice temperatures:
7: -1.72147865098908 -3.75992429390085 -2.28765319722964
7: -2.38488837169471
7: Final snow temperatures:
7: -0.951251422737992
7: Final ice temperatures:
7: -3.505690024561692E-002 -2.98627214735190 -2.34651937273828
7: -2.29567453348333
7: istep1, my_task, iblk = 1 7 1
7: Global block: 12
7: Global i and j: 97 83
7: Lat, Lon: 47.4023336155539 -50.2956950320106
7: ERROR: ice: Vertical thermo error
@ndkeen - these issues are unrelated. The error in the DTEST compset is from CICE, while the high-res G-case should be MPAS-CICE. They are different models, though they share some coding. Can you point me at the high-res issue separately?
OK, I see the difference. I was using the test (./create_test ERS.f45_g37_rx1.DTEST --machine=cori-knl --compiler=intel18
) as a proxy since it was getting the "same error" (same message) and easier to run. I see that this is not even testing MPAS. I tried a few G cases with latest intel18 (and previous runs with intel17) and have not yet run into this problem again. My original comment was a F-compset run, which I can try again, but maybe makes sense to separate the issues as you suggest. One issue is with ICE in the above test. The other is a sporadic issue that sometimes happens (or maybe no longer?) with MPAS-CICE.
@ndkeen - is this still an issue? If not, can you please close it? Thanks
I'm fine closing this as I don't have an easy way to reproduce it and have not seen it (when using MPAS) in a while.
Trying to run the ne120 problem with different layouts to get best performance on cori-knl. With 424 nodes, I successfully used 7200 MPI's with 1,8, and 16 threads (all components). But going to 14400 MPI's and 8 threads, I hit this strange error: