E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
344 stars 351 forks source link

Restart problems on cori with WCYCL case. #774

Closed rljacob closed 8 years ago

rljacob commented 8 years ago

Report from @jonbob: I ran some further tests, and we appear to be OK on edison -- meaning exact restart test pass. But very different behavior on cori, which I assume is due to the newer version of the intel compiler. Specifically, our active-ocn-only tests fail, so any test -- or runs -- with the ocean model will fail this restart issue. I'll try to test on cori with the intel15 compiler, and make sure that's the root of the problem. But in the meantime, I'm not sure we should be using cori.

Even with intel 15, the A_WCYCL2000 exact restart test fails on cori. So we should avoid running there until we figure out what's going on.

rljacob commented 8 years ago

Does any compset restart correctly on cori? Asking @tangq or @bishtgautam

jonbob commented 8 years ago

@rljacob : I tried A_WCYCL2000 and some active-MPAS-only compsets and nothing has passed yet, especially using the default machine configuration. I also ran A_WCYCL2000 with intel 15 instead of the default intel 16 and it failed the exact restart test as well

jonbob commented 8 years ago

@rljacob exact restart tests fail for A_WCYCL2000 as well as active-ocn-only, but pass for active-ice-only. There may be trouble in other components as well, but definitely in ocn. I've tried to change the environment to be more similar to edison's, but so far that has had no impact. I'll also try the F case and see if the issue can be isolated to the ocean model.

jonbob commented 8 years ago

@rljacob the issue seems to be isolated in mpas-o. The FC5AC1C exact restart test passes, as does the active-cice-only. I'll talk to @douglasjacobsen

tangq commented 8 years ago

@rljacob , I somehow missed your question (sorry). I haven't run jobs on Cori, so don't know the answer to your question. But I confirmed that the FC5AV1C-01 and the corresponding A_WCYCL2000 compsets restart correctly on edison.

rljacob commented 8 years ago

That's ok @tangq . @jonbob checked an F case on cori and it works. In general, restart ability should be checked with all our machine-compiler combinations.

douglasjacobsen commented 8 years ago

Just a small update. I was testing with: ERS_Ln3.T62_mpas120.CMPASO-NYF.corip1_intel

I'm not actually convinced that it's a problem with the ocean core.

If I turn off all of the dynamics I still see the issue, but also if I run the same exact setup multiple times, I get different differences.

I'll keep looking into it, but maybe someone else has some ideas of things to try.

worleyph commented 8 years ago

I haven't had a chance to chase this yet, but I also believe that I have seen exactly the same job producing different results from run to run (on Cori at least), and this was in the atmosphere. I'll try to verify that my instance is repeatable. (My evidence is buried in amongst my other recent experiments, and I am not sure that this is actually the case.)

douglasjacobsen commented 8 years ago

Thanks @worleyph. In my example case, the two fields that differ are: o2x_So_u and o2x_So_dhdx. So there is a coincidence that both of these are reconstructed fields in the same direction. I'll look through the ocean code to see if I can convince myself that there is an issue, but in case other people see effectively random results on cori, it might be good to post here or let other people know.

worleyph commented 8 years ago

So, I am not sure what I am doing, nor what the behavior should be. Probably need to have this defined more carefully, and then to assign multiple people to look at this, one for each platform.

In my latest experiments, running A_WCYCL2000 with resolution ne30_oEC on Cori with nested OpenMP disabled (no -DCOLUMN_OPENMP), with BUILD_THREADED == TRUE, and a PE layout without OpenMP, muliple runs with identical executables and other runtime settings show divergence in atm.log output at nstep 5. I also tried running with BFBFLAG == TRUE, and two runs with this setting also diverged at nstep = 5, and differed from the BFBFLAG == FALSE at nstep = 2 (though the latter is probably expected).

Most of the details described here are probably irrelevant, but this is the case I was working in when I decided to try this. I think that I am getting reproducible results on Titan, though jobs are taking forwever to schedule there. (Jobs with different thread settings are getting identical results, which should be sufficient. Still need to run more tests.) I have not had the chance to do much on Edison - my jobs there are also taking forever to schedule.

Perhaps the F case would be sufficient to look at this in the atmosphere. I will not be able to continue this study myself, so encourage others to take over ( @amametjanov ? @ndkeen ? for Edison, and to verify my observations on Cori? @mrnorman to verify on Titan?) I'm guessing that working exact restart tests are sufficient to establish this, but getting something to compare Cori with would still be useful?

douglasjacobsen commented 8 years ago

Something interesting I found out this morning.

Results: ERS_Ln3.T62_mpas120.CMPASO-NYF.corip1_intel - FAILED (Not BFB) ERS_Ln3_D.T62_mpas120.CMPASO-NYF.corip1_intel - PASSED (BFB)

I'm not sure why yet, but I'll continue to look into it as I get time.

douglasjacobsen commented 8 years ago

Another note. I used ERS_Ln3.T62_mpas120.CMPASO-NYF.corip1_intel where I changed -O2 to -O0 in the Macros file, and it passed the ERS test as well.

rljacob commented 8 years ago

I think we're going to have to back-off the intel v16 compiler. Is v15 supported on cori?

rljacob commented 8 years ago

Nevermind, I forgot @jonbob already tested v15 and the problem doesn't go away.

singhbalwinder commented 8 years ago

Is this problem only exists on Cori? That is, does these tests work fine on other machines?

worleyph commented 8 years ago

So, github just crashed? I was trying to make the following comment:

It might be worthwhile trying an experiment with

    -fp-model precise

I see that this is set for the C flags, but not the Fortran flags. Not arguing that we should require this level of numerical precision (though we do require -Kieee on Titan currently), but it might help resolve the source of the problem. If exasct restarts become enabled again, we can then look for which routines are sensitive to this.

Pat

On 3/21/16 1:10 PM, singhbalwinder wrote:

Is this problem only exists on Cori? That is, does these tests work fine on other machines?


You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ACME-Climate/ACME/issues/774#issuecomment-199384335

douglasjacobsen commented 8 years ago

@worleyph Yes. I tried with that flag, and that's one of the ones that fixed it.

@singhbalwinder Yes, so far this only exists on cori.

Cori's scheduler is down right now too, so I can't submit anymore jobs to test until that comes back up.

douglasjacobsen commented 8 years ago

@worleyph Also, based on the diff, I have two ideas for routines that might be sensitive to it, but the weird thing is there are multiple fields that are computed using these same routines and most of them don't have a diff.

singhbalwinder commented 8 years ago

I think, in the atm model, we have this flag ( -fp-model precise or -fp-model strict) always turned on. Is there a huge performance penalty in using this flag?

In the recent past, NCAR team has found that omitting this flag with -O3 optimization, intel v15.0 produces weird answers on Yellowstone machine. The model runs fine but the answers are wrong.

mt5555 commented 8 years ago

FYI:

worleyph commented 8 years ago

Looking at a case on cori, the model settings are:

 CFLAGS:= -O2 -fp-model precise
 FFLAGS:=  -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs -fp-model source
  FFLAGS += -O2
worleyph commented 8 years ago

The same settings are used on Edison (as on Cori).

worleyph commented 8 years ago

@singhbalwinder , please note

 I think, in the atm model, we have this flag ( -fp-model precise or -fp-model strict) always turned on. 

that this is not true in ACME. If it you expected that it was on, you need to add it (back) in.

douglasjacobsen commented 8 years ago

Turns out this needs -fp-model precise to be BFB across restarts.

singhbalwinder commented 8 years ago

Thanks @worleyph . I was working with pre-cime version of ACME when I noticed this flag turned on by default (at least on Yellowstone). It might have changed post-cime.

@douglasjacobsen : That is an interesting find as same flags combination with similarish environment works fine on Edison but not on Cori (if I am getting it right).

douglasjacobsen commented 8 years ago

Yes, that's correct, and I agree it's very weird. I'm testing now with removing -fp-model source, since it seems like kind of a weird flag to be using (anyone know why we have this in the default?).

worleyph commented 8 years ago

For the most part, ACME settings were preserved in ACME post-cime, which is why we have machines-acme, for example. Probably not worth trying to determine the history of this setting though.

The current setting in CESM for the intel Fortran compiler are:

  -no-opt-dynamic-align  -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs -fp-model source 

so still not 'precise'. Other flags though.

douglasjacobsen commented 8 years ago

I guess more what I mean is why would one want to use -fp-model source? The man page describes it as something (that to me) seems kind of weird to want to use:

precise        Disables optimizations that are not  value-safe on floating-point data and rounds intermediate results to source-defined precision.

fast[=1|2]     Enables more aggressive optimizations on floating-point data.

strict         Enables precise and except, disables contractions, and enables the property that allows modification of the floating-point environment.

source         Rounds intermediate results to source-defined precision.
douglasjacobsen commented 8 years ago

Well, it seems having it on does do some good.... turning it off gives the following diff in a 3 step ERS test (ocean only still):

   RMS o2x_So_t                         1.5348E-14            NORMALIZED  5.2686E-17
    RMS o2x_So_s                         6.8537E-15            NORMALIZED  1.9794E-16
    RMS o2x_So_u                         5.2180E-14            NORMALIZED  4.9374E-13
    RMS o2x_So_v                         1.3290E-14            NORMALIZED  1.8000E-13
    RMS o2x_So_dhdx                      8.1301E-23            NORMALIZED  4.9394E-15
    RMS o2x_So_dhdy                      8.6966E-23            NORMALIZED  1.9792E-15
    RMS o2x_Fioo_meltp                   3.9106E-10            NORMALIZED  1.7182E-15
    RMS x2oacc_So_duu10n                 8.4069E-17            NORMALIZED  1.1706E-18
    RMS x2oacc_Foxx_sen                  3.5161E-16            NORMALIZED  1.4057E-17
    RMS xaoo_So_u10                      5.2543E-18            NORMALIZED  6.7102E-19
    RMS xaoo_So_duu10n                   8.4069E-17            NORMALIZED  1.1706E-18
    RMS xaoo_Faox_sen                    4.8607E-16            NORMALIZED  1.1910E-17
worleyph commented 8 years ago

Hi @douglasjacobsen , I was responding to the first comment by @singhbalwinder . Sorry if there was any confusion. I have no information on reasons to use, or not, -fp-model source.

douglasjacobsen commented 8 years ago

@worleyph No problem!

singhbalwinder commented 8 years ago

Thanks @worleyph. You are right, we always used "fp-model source" in FFLAGS. I got confused between FFLAGS and CFLAGS where we use "-fp-model precise" for CAM.

jonbob commented 8 years ago

Thanks @worleyph for the idea. I just completed a ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel test that passed. I had modified the Macros file to be more consistent with what CESM uses by changing -fp-model to precise and adding -no-opt-dynamic-align. Let me see if we can get away with not changing -fp-model -- that test is building right now. But if necessary, the two changes appear to fix the problem with reproducibility.

jonbob commented 8 years ago

My latest ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel passed as well. It just added the -no-opt-dynamic-align option to FFLAGS in the Macros file, but did not change the fp-model setting. So I believe that is all we need to do to make cori a functional platform for us. @worleyph - thanks for the idea, and could you please test it as well? Thanks

amametjanov commented 8 years ago

Checking on ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel with the latest master to confirm

jonbob commented 8 years ago

Thanks @amametjanov

amametjanov commented 8 years ago

ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel succeeded (with 1.6 SYPD). Thanks @jonbob.

jonbob commented 8 years ago

@amametjanov thanks for testing -- and I'm glad it worked

jonbob commented 8 years ago

@amametjanov - and thanks for getting the change committed!

amametjanov commented 8 years ago

Do you happen to have SYPD on ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel without this flag?

jonbob commented 8 years ago

I'll check

jonbob commented 8 years ago

I have numbers from a couple of smoke tests: on 1024 pes: 1.65, 1.62, 1.66, 1.72 Does that help?

amametjanov commented 8 years ago

Yes. I also checked on the ERS test without the flag and got 1.51 SYPD.

jonbob commented 8 years ago

@amametjanov How about with the fix? Did you see much performance degradation?

amametjanov commented 8 years ago

It's actually faster with the flag -- 1.6 SYPD: https://github.com/ACME-Climate/ACME/issues/774#issuecomment-215154772

jonbob commented 8 years ago

major score then!

jonbob commented 8 years ago

closing this then