E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
355 stars 368 forks source link

MPAS output all zero values in a middle of a v3 coupled simulation #6051

Open zhangshixuan1987 opened 1 year ago

zhangshixuan1987 commented 1 year ago

With master (Hash: 84e50561a854e1888b0eaa52fc3a44287f3a5924), I've been trying to run a fully coupled simulation with atmospheric nudging to test the impact of the wind forcing over the subpolar North Atlantic on AMOC. The simulation was run on pm-cpu with intel compiler, which is documented on the following confluence page,

In brief,

One error appears when I check the results obtained from the MPASS diagnostics. There is kink appears at around year 0034-0035 as shown in the figure below for the ocean heat contents: image Similar issues are also seen in the AMOC timeseries

Further diagnostics indicate that the issues pointed to the model output at 0034-10-01 from mpass-ocean: the output for almost all quantities are zero values in the model historical files (mpaso.hist.am.timeSeriesStatsMonthly.0034-10-01.nc). Only this file has has the issue, the other historical files look correct.

We note that 0034-10-01 was saved in the middle of the simulation, and the model neither crashed nor reported an error during the whole simulation period of 0034-01-01 -- 0043-09-11. Therefore, it seems that this could be potentially a hiccup or a bug related to the i/o infrastructure (in the model, file system, or IO nodes if pm-cpu uses one).

Reported here in case it recurs. For this case, we are going to re-run year 0034 to see if simulation data beyond the problematic month are affected.

xylar commented 1 year ago

@zhangshixuan1987, my first guess would be that this was a glitch of some sort in the Perlmutter file system. I haven't seen a problem like this before that I recall. Could you try rerunning just year 0034 from a restart file and see if the output gets corrected?

zhangshixuan1987 commented 1 year ago

Following suggestions from @wlin7 and @xylar, I conducted a "continue run" with the restart files saved at 0034-01-01. The simulation was run for 2 years from 0034-01-01 to 0036-01-01 and the model output was saved. The new generated model output during the 0034-01-01 -- 0036-01-01 was used to replace the old model output files at these periods. Then I rerun the MPASS diagnostics. The kinks at around year 0034-0035 in the figure of ocean heat contents now disappear:

image

I also checked the historical files regenerated by E3SM for "mpaso.hist.am.timeSeriesStatsMonthly.0034-10-01.nc", and all quantities in this file now have reasonable values rather than "zeros". Therefore, I think @xylar is correct that the issues are likely due to "a glitch of some sort in the Perlmutter file system". However, the reason why such a glitch showed up in my simulation is still not clear to me.

xylar commented 1 year ago

@zhangshixuan1987, I agree, this is mysterious and frustrating. Certainly if it happens again, we need to figure out a way to reproduce it so we can prevent it from happening again. For now, let's hope it's a one-time event!

rljacob commented 1 year ago

Adding @ndkeen and @jayeshkrishna to note glitch.

zhangshixuan1987 commented 1 year ago

Following suggestions from Wuyin (@wlin7), I also run the "/global/cfs/cdirs/e3sm/tools/cprnc/cprnc" on the file

Overall, the two files are likely bit-for-bit identical, suggesting that the model simulation for other component seems to be not affected.

ndkeen commented 1 year ago

Just noting that we had a similar-sounding issue a few years ago, but surely it's not the same thing. https://github.com/E3SM-Project/E3SM/issues/4174