E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
346 stars 353 forks source link

Vertical thermo error in ICE #1194

Closed ndkeen closed 6 years ago

ndkeen commented 7 years ago

Trying to run the ne120 problem with different layouts to get best performance on cori-knl. With 424 nodes, I successfully used 7200 MPI's with 1,8, and 16 threads (all components). But going to 14400 MPI's and 8 threads, I hit this strange error:

09446:  Thermo iteration does not converge,istep1, my_task, i, j:         417
09446:         9446           2           2
09446:  Ice thickness:   2.00000000000000
09446:  Snow thickness:  9.540449954461521E-003
09446:  dTsf, Tsf_errmax:  -94.0848214215804       5.000000000000000E-004
09446:  Tsf:  -1112.14653413169
09446:  fsurf:  -12270.6566596381
09446:  fcondtop, fcondbot, fswint  -12270.1168873794      -0.514900436911363
09446:   -19847.0600869532
09446:  fswsfc, fswthrun  -21779.3519269064       -2548.51468152651
09446:  Flux conservation error =   90.5036874039033
09446:  Internal snow absorption:
09446:   -822.081992523233
09446:  Internal ice absorption:
09446:   -14723.3575388669       -2243.81715461096       -1345.42168168170
09446:   -712.381719270388
09446:  Initial snow temperatures:
09446:  -0.880099655950668
09446:  Initial ice temperatures:
09446:   -1.46222162681964       -1.74324215122892       -1.79381145381079
09446:   -1.79869012897443
09446:  Final snow temperatures:
09446:   -917.042473965963
09446:  Final ice temperatures:
09446:   -17.3607235819758       -2.06867766777213       -1.94224931647653
09446:   -1.87153405513137
09446:  istep1, my_task, iblk =         417        9446           7
09446:  Global block:       95847
09446:  Global i and j:      670923           1
09446:  Lat, Lon:   59.3207611224500        17.8243937597401
09446:  ERROR: ice: Vertical thermo error
09446: Image              PC                Routine            Line        Source
09446: cesm.exe           00000000241C5FDD  Unknown               Unknown  Unknown
09446: cesm.exe           0000000022A034AC  shr_sys_mod_mp_sh         230  shr_sys_mod.F90
09446: cesm.exe           0000000021F14F95  cice_runmod_mp_st         412  CICE_RunMod.F90
09446: cesm.exe           0000000021F12F84  cice_runmod_mp_st         114  CICE_RunMod.F90
09446: cesm.exe           00000000243182D3  Unknown               Unknown  Unknown
09446: cesm.exe           00000000242D0F40  Unknown               Unknown  Unknown
09446: cesm.exe           00000000242D0205  Unknown               Unknown  Unknown
09446: cesm.exe           00000000243186C1  Unknown               Unknown  Unknown
09446: cesm.exe           0000000023EAB3A4  Unknown               Unknown  Unknown
09446: cesm.exe           0000000024430D99  Unknown               Unknown  Unknown
09446: Rank 9446 [Wed Dec 21 03:19:55 2016] [c1-5c0s9n1] application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 9446
gold2718 commented 7 years ago

Is this run with CICE or MPAS_CICE? Is this a clean build? CICE needs to know the layout at compile time.

ndkeen commented 7 years ago

As it is a F case, I thought there was no ice? It was a clean build.

gold2718 commented 7 years ago

If you look at the F compsets, you will see that most of them use CICE and CLM (exceptions are the aquaplanet and ideal compsets). For CICE builds, if you change the number of tasks, you have to completely rebuild the case (./case.build --clean-all; rm -rf bld; ./case.setup --clean; ./case.setup; ./case.build). Is that true for when you switched to 14400 tasks?

ndkeen commented 7 years ago

Ah ok, so there ice. This is FC5AV1C-04P. Yes, in most cases (and this one), I start with a new case directory and clean build. If this is a common error, I can try again. I did try again with 4 threads (starting with a new case) and I hit a different segfault. It may be that I'm near a memory limit.

maltrud commented 7 years ago

this is an issue we've encountered on a regular basis with both the high resolution (0.1 degree POP/CICE) v0 B-cases, as well as the RRS18to6km MPAS G-cases. we're still not sure what causes the problem, but small changes in the setup can cause it to go away. for example, the first time we saw this was on Mira. we moved the run to Titan and the problem went away. you can change the optimization level and it will often go away. or change the dynamics subcycling in the sea ice model. we should probably figure out the root cause at some point.

maltrud commented 7 years ago

sorry--hit the wrong button....

ndkeen commented 7 years ago

Thanks @maltrud . That's why I posted it -- maybe someone has seen it before, or maybe it will help someone verify it's "real". Not causing me a problem at the moment. I could run in some other ways if it helps to investigate.

akturner commented 7 years ago

The first thing to note is that failures in other parts of the system are often caught in the sea ice vertical thermodynamics since the thermodynamics is iterative with a convergence criterion. Unphysical values generated elsewhere will often propagate until they cause the sea ice vertical thermodynamics not to converge. Here we see a sea ice surface temperature of -1000 C, which suggests a problem with the atmospheric fluxes. Jon has found places in the atmosphere model where after a failure the model continues after adding -999 to fields (https://github.com/ACME-Climate/ACME/issues/1292). This may have happened here.

ndkeen commented 7 years ago

I have not seen this again in a while and I have been running various ne120 F cases.

ndkeen commented 7 years ago

Well I just did see this again. Using the beta release of Intel v18 on cori-knl. This time with 2 tests in acme_dev: ERS_IOP.f45_g37_rx1.DTEST and ERS.f45_g37_rx1.DTEST

5:  Lat, Lon:   75.0526957285091        113.813909090145     
5:  ERROR: ice: Vertical thermo error
5: Image              PC                Routine            Line        Source             
5: acme.exe           0000000020EC8408  Unknown               Unknown  Unknown
5: acme.exe           00000000204C686C  Unknown               Unknown  Unknown
5: acme.exe (deleted  0000000020209BC0  Unknown               Unknown  Unknown
5: acme.exe           00000000202086D9  cice_runmod_mp_st         114  CICE_RunMod.F90
5: acme.exe           000000002010026B  Unknown               Unknown  Unknown
5: acme.exe (deleted  00000000200270C6  Unknown               Unknown  Unknown
5: acme.exe           000000002000D58D  Unknown               Unknown  Unknown
5: acme.exe           0000000020026DD2  MAIN__                     68  cesm_driver.F90
5: acme.exe           000000002000B1DE  Unknown               Unknown  Unknown
5: acme.exe           0000000020FB1B60  Unknown               Unknown  Unknown
5: acme.exe           000000002000B0C7  Unknown               Unknown  Unknown
5: Rank 5 [Thu Jul 27 19:46:19 2017] [c5-1c2s7n3] application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 5

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/mmf-i18impib-again/ERS_IOP.f45_g37_rx1.DTEST.cori-knl_intel18.20170727_175124_yx4jhi

singhbalwinder commented 7 years ago

Is this error reproducible? If yes, we should try to find the root of this error. We might not be using CICE in future but, like @mt5555 mentioned, the error might be coming from somewhere else.

We can see if the debug mode reveals more info about this error. @ndkeen : Can you please try ERS_D.f45_g37_rx1.DTEST to see if it also fails?

ndkeen commented 7 years ago

I realize I'm using a beta version of the compiler, but because I stumbled upon this with an acme_dev test, I thought it could make it easier to find. I started create_test --machine=cori-knl --compiler=intel18 ERS_D_IOP.f45_g37_rx1.DTEST ERS_D.f45_g37_rx1.DTEST

Note: I'm using a branch where I've added this intel18 option for cori-knl. I started with master as of yesterday. I'm ready to make a PR to get it into repo as it's only an option. https://github.com/ACME-Climate/ACME/pull/1685

ndkeen commented 7 years ago

Those two _D tests did pass.

singhbalwinder commented 7 years ago

Ok. It seems like some kind of memory issue if it passes in debug mode but it can be some other issue also (compiler bug?). Unless there is a better alternative, I think we can proceed with the following:

If the error is reproducible in non-debug run, it might be useful to compile the code again using only -g flag (for producing debugging information) and using a debugger. But first we need to make sure that the code fails when we add only -g to the compiler options....

If debugging is taking too much time, we must first evaluate whether it is useful to debug this or not. @mt5555 , @philrasch and @rljacob: any thoughts on this?

ndkeen commented 7 years ago

This error is preventing me from using intel18 with a highres G case. Anything we can do to debug with the small test that I noted above?

For example, I just ran this again with a recent master: ./create_test ERS.f45_g37_rx1.DTEST --machine=cori-knl --compiler=intel18

And I get the following:

7:  Thermo iteration does not converge,istep1, my_task, i, j:           1
7:            7          23          26
7:  Ice thickness:  0.312148206064179     
7:  Snow thickness:  2.413462149830149E-002
7:  dTsf, Tsf_errmax: -1.842970220877760E-014  5.000000000000000E-004
7:  Tsf:  -1.76631202746488     
7:  fsurf:  -20.2628561160800     
7:  fcondtop, fcondbot, fswint  -20.2628561160800       -23.5797515784896     
7:   0.000000000000000E+000
7:  fswsfc, fswthrun  0.000000000000000E+000  0.000000000000000E+000
7:  Flux conservation error =   79033.8875786383     
7:  Internal snow absorption:
7:   0.000000000000000E+000
7:  Internal ice absorption:
7:   0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
7:   0.000000000000000E+000
7:  Initial snow temperatures:
7:   0.000000000000000E+000
7:  Initial ice temperatures:
7:   -1.72147865098908       -3.75992429390085       -2.28765319722964     
7:   -2.38488837169471     
7:  Final snow temperatures:
7:  -0.951251422737992     
7:  Final ice temperatures:
7:  -3.505690024561692E-002  -2.98627214735190       -2.34651937273828     
7:   -2.29567453348333     
7:  istep1, my_task, iblk =           1           7           1
7:  Global block:          12
7:  Global i and j:          97          83
7:  Lat, Lon:   47.4023336155539       -50.2956950320106     
7:  ERROR: ice: Vertical thermo error
jonbob commented 7 years ago

@ndkeen - these issues are unrelated. The error in the DTEST compset is from CICE, while the high-res G-case should be MPAS-CICE. They are different models, though they share some coding. Can you point me at the high-res issue separately?

ndkeen commented 6 years ago

OK, I see the difference. I was using the test (./create_test ERS.f45_g37_rx1.DTEST --machine=cori-knl --compiler=intel18) as a proxy since it was getting the "same error" (same message) and easier to run. I see that this is not even testing MPAS. I tried a few G cases with latest intel18 (and previous runs with intel17) and have not yet run into this problem again. My original comment was a F-compset run, which I can try again, but maybe makes sense to separate the issues as you suggest. One issue is with ICE in the above test. The other is a sporadic issue that sometimes happens (or maybe no longer?) with MPAS-CICE.

jonbob commented 6 years ago

@ndkeen - is this still an issue? If not, can you please close it? Thanks

ndkeen commented 6 years ago

I'm fine closing this as I don't have an easy way to reproduce it and have not seen it (when using MPAS) in a while.