Closed rljacob closed 8 years ago
Does any compset restart correctly on cori? Asking @tangq or @bishtgautam
@rljacob : I tried A_WCYCL2000 and some active-MPAS-only compsets and nothing has passed yet, especially using the default machine configuration. I also ran A_WCYCL2000 with intel 15 instead of the default intel 16 and it failed the exact restart test as well
@rljacob exact restart tests fail for A_WCYCL2000 as well as active-ocn-only, but pass for active-ice-only. There may be trouble in other components as well, but definitely in ocn. I've tried to change the environment to be more similar to edison's, but so far that has had no impact. I'll also try the F case and see if the issue can be isolated to the ocean model.
@rljacob the issue seems to be isolated in mpas-o. The FC5AC1C exact restart test passes, as does the active-cice-only. I'll talk to @douglasjacobsen
@rljacob , I somehow missed your question (sorry). I haven't run jobs on Cori, so don't know the answer to your question. But I confirmed that the FC5AV1C-01 and the corresponding A_WCYCL2000 compsets restart correctly on edison.
That's ok @tangq . @jonbob checked an F case on cori and it works. In general, restart ability should be checked with all our machine-compiler combinations.
Just a small update. I was testing with: ERS_Ln3.T62_mpas120.CMPASO-NYF.corip1_intel
I'm not actually convinced that it's a problem with the ocean core.
If I turn off all of the dynamics I still see the issue, but also if I run the same exact setup multiple times, I get different differences.
I'll keep looking into it, but maybe someone else has some ideas of things to try.
I haven't had a chance to chase this yet, but I also believe that I have seen exactly the same job producing different results from run to run (on Cori at least), and this was in the atmosphere. I'll try to verify that my instance is repeatable. (My evidence is buried in amongst my other recent experiments, and I am not sure that this is actually the case.)
Thanks @worleyph. In my example case, the two fields that differ are: o2x_So_u and o2x_So_dhdx. So there is a coincidence that both of these are reconstructed fields in the same direction. I'll look through the ocean code to see if I can convince myself that there is an issue, but in case other people see effectively random results on cori, it might be good to post here or let other people know.
So, I am not sure what I am doing, nor what the behavior should be. Probably need to have this defined more carefully, and then to assign multiple people to look at this, one for each platform.
In my latest experiments, running A_WCYCL2000 with resolution ne30_oEC on Cori with nested OpenMP disabled (no -DCOLUMN_OPENMP), with BUILD_THREADED == TRUE, and a PE layout without OpenMP, muliple runs with identical executables and other runtime settings show divergence in atm.log output at nstep 5. I also tried running with BFBFLAG == TRUE, and two runs with this setting also diverged at nstep = 5, and differed from the BFBFLAG == FALSE at nstep = 2 (though the latter is probably expected).
Most of the details described here are probably irrelevant, but this is the case I was working in when I decided to try this. I think that I am getting reproducible results on Titan, though jobs are taking forwever to schedule there. (Jobs with different thread settings are getting identical results, which should be sufficient. Still need to run more tests.) I have not had the chance to do much on Edison - my jobs there are also taking forever to schedule.
Perhaps the F case would be sufficient to look at this in the atmosphere. I will not be able to continue this study myself, so encourage others to take over ( @amametjanov ? @ndkeen ? for Edison, and to verify my observations on Cori? @mrnorman to verify on Titan?) I'm guessing that working exact restart tests are sufficient to establish this, but getting something to compare Cori with would still be useful?
Something interesting I found out this morning.
Results: ERS_Ln3.T62_mpas120.CMPASO-NYF.corip1_intel - FAILED (Not BFB) ERS_Ln3_D.T62_mpas120.CMPASO-NYF.corip1_intel - PASSED (BFB)
I'm not sure why yet, but I'll continue to look into it as I get time.
Another note. I used ERS_Ln3.T62_mpas120.CMPASO-NYF.corip1_intel where I changed -O2 to -O0 in the Macros file, and it passed the ERS test as well.
I think we're going to have to back-off the intel v16 compiler. Is v15 supported on cori?
Nevermind, I forgot @jonbob already tested v15 and the problem doesn't go away.
Is this problem only exists on Cori? That is, does these tests work fine on other machines?
So, github just crashed? I was trying to make the following comment:
It might be worthwhile trying an experiment with
-fp-model precise
I see that this is set for the C flags, but not the Fortran flags. Not arguing that we should require this level of numerical precision (though we do require -Kieee on Titan currently), but it might help resolve the source of the problem. If exasct restarts become enabled again, we can then look for which routines are sensitive to this.
Pat
On 3/21/16 1:10 PM, singhbalwinder wrote:
Is this problem only exists on Cori? That is, does these tests work fine on other machines?
You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ACME-Climate/ACME/issues/774#issuecomment-199384335
@worleyph Yes. I tried with that flag, and that's one of the ones that fixed it.
@singhbalwinder Yes, so far this only exists on cori.
Cori's scheduler is down right now too, so I can't submit anymore jobs to test until that comes back up.
@worleyph Also, based on the diff, I have two ideas for routines that might be sensitive to it, but the weird thing is there are multiple fields that are computed using these same routines and most of them don't have a diff.
I think, in the atm model, we have this flag ( -fp-model precise or -fp-model strict) always turned on. Is there a huge performance penalty in using this flag?
In the recent past, NCAR team has found that omitting this flag with -O3 optimization, intel v15.0 produces weird answers on Yellowstone machine. The model runs fine but the answers are wrong.
FYI:
Looking at a case on cori, the model settings are:
CFLAGS:= -O2 -fp-model precise
FFLAGS:= -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs -fp-model source
FFLAGS += -O2
The same settings are used on Edison (as on Cori).
@singhbalwinder , please note
I think, in the atm model, we have this flag ( -fp-model precise or -fp-model strict) always turned on.
that this is not true in ACME. If it you expected that it was on, you need to add it (back) in.
Turns out this needs -fp-model precise
to be BFB across restarts.
Thanks @worleyph . I was working with pre-cime version of ACME when I noticed this flag turned on by default (at least on Yellowstone). It might have changed post-cime.
@douglasjacobsen : That is an interesting find as same flags combination with similarish environment works fine on Edison but not on Cori (if I am getting it right).
Yes, that's correct, and I agree it's very weird. I'm testing now with removing -fp-model source
, since it seems like kind of a weird flag to be using (anyone know why we have this in the default?).
For the most part, ACME settings were preserved in ACME post-cime, which is why we have machines-acme, for example. Probably not worth trying to determine the history of this setting though.
The current setting in CESM for the intel Fortran compiler are:
-no-opt-dynamic-align -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs -fp-model source
so still not 'precise'. Other flags though.
I guess more what I mean is why would one want to use -fp-model source
? The man page describes it as something (that to me) seems kind of weird to want to use:
precise Disables optimizations that are not value-safe on floating-point data and rounds intermediate results to source-defined precision.
fast[=1|2] Enables more aggressive optimizations on floating-point data.
strict Enables precise and except, disables contractions, and enables the property that allows modification of the floating-point environment.
source Rounds intermediate results to source-defined precision.
Well, it seems having it on does do some good.... turning it off gives the following diff in a 3 step ERS test (ocean only still):
RMS o2x_So_t 1.5348E-14 NORMALIZED 5.2686E-17
RMS o2x_So_s 6.8537E-15 NORMALIZED 1.9794E-16
RMS o2x_So_u 5.2180E-14 NORMALIZED 4.9374E-13
RMS o2x_So_v 1.3290E-14 NORMALIZED 1.8000E-13
RMS o2x_So_dhdx 8.1301E-23 NORMALIZED 4.9394E-15
RMS o2x_So_dhdy 8.6966E-23 NORMALIZED 1.9792E-15
RMS o2x_Fioo_meltp 3.9106E-10 NORMALIZED 1.7182E-15
RMS x2oacc_So_duu10n 8.4069E-17 NORMALIZED 1.1706E-18
RMS x2oacc_Foxx_sen 3.5161E-16 NORMALIZED 1.4057E-17
RMS xaoo_So_u10 5.2543E-18 NORMALIZED 6.7102E-19
RMS xaoo_So_duu10n 8.4069E-17 NORMALIZED 1.1706E-18
RMS xaoo_Faox_sen 4.8607E-16 NORMALIZED 1.1910E-17
Hi @douglasjacobsen , I was responding to the first comment by @singhbalwinder . Sorry if there was any confusion. I have no information on reasons to use, or not, -fp-model source.
@worleyph No problem!
Thanks @worleyph. You are right, we always used "fp-model source" in FFLAGS. I got confused between FFLAGS and CFLAGS where we use "-fp-model precise" for CAM.
Thanks @worleyph for the idea. I just completed a ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel test that passed. I had modified the Macros file to be more consistent with what CESM uses by changing -fp-model to precise and adding -no-opt-dynamic-align. Let me see if we can get away with not changing -fp-model -- that test is building right now. But if necessary, the two changes appear to fix the problem with reproducibility.
My latest ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel passed as well. It just added the -no-opt-dynamic-align option to FFLAGS in the Macros file, but did not change the fp-model setting. So I believe that is all we need to do to make cori a functional platform for us. @worleyph - thanks for the idea, and could you please test it as well? Thanks
Checking on ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel with the latest master to confirm
Thanks @amametjanov
ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel succeeded (with 1.6 SYPD). Thanks @jonbob.
@amametjanov thanks for testing -- and I'm glad it worked
@amametjanov - and thanks for getting the change committed!
Do you happen to have SYPD on ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel without this flag?
I'll check
I have numbers from a couple of smoke tests: on 1024 pes: 1.65, 1.62, 1.66, 1.72 Does that help?
Yes. I also checked on the ERS test without the flag and got 1.51 SYPD.
@amametjanov How about with the fix? Did you see much performance degradation?
It's actually faster with the flag -- 1.6 SYPD: https://github.com/ACME-Climate/ACME/issues/774#issuecomment-215154772
major score then!
closing this then
Report from @jonbob: I ran some further tests, and we appear to be OK on edison -- meaning exact restart test pass. But very different behavior on cori, which I assume is due to the newer version of the intel compiler. Specifically, our active-ocn-only tests fail, so any test -- or runs -- with the ocean model will fail this restart issue. I'll try to test on cori with the intel15 compiler, and make sure that's the root of the problem. But in the meantime, I'm not sure we should be using cori.
Even with intel 15, the A_WCYCL2000 exact restart test fails on cori. So we should avoid running there until we figure out what's going on.