E3SM-Project / scream

Exascale global atmosphere model written in C++ as part of the E3SM project
https://e3sm-project.github.io/scream/
Other
70 stars 48 forks source link

IO overwriting of monthly averages #2890

Open mahf708 opened 5 days ago

mahf708 commented 5 days ago

Another concerning issue in the EAMxx IO. Consider the following atm.log snippet:

Atmosphere step = 342143
  model start-of-step time = 2020-08-31 23:58:20

[EAMxx::output_manager] - Writing model-output:
[EAMxx::output_manager]      FILE: 1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc
[EAMxx::scorpio_output] Writing variables to file
  file name: 1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc

The result: the monthly output file was overwritten. This happened in two instances in one run:

1ma_ne30pg2.AVERAGE.nmonths_x1.2019-08-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-09-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-10-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-11-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-12-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-01-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-01-01
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-04-01-00000.nc >>>>>>>>>>>>>>>
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-05-01-00000.nc 
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-06-01
                                                   >>>>>>>>>>>>>>> simulation ends

See internal outputs https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/4334223933/EAMxx+ERFaer+production from a recent run using commit https://github.com/E3SM-Project/scream/commit/29bdb81 on branch https://github.com/E3SM-Project/scream/tree/mahf708-ff-a73d48a

crterai commented 5 days ago

I think this is the first time we've seen this, but checking with @ndkeen to see if he has seen something like this. @AaronDonahue @bartgol : any ideas on what might be going on here? And if there's a fix, we should make sure to get it into @brhillman's decadal run. And we should keep an eye on the averaged output in the decadal sim until we find the cause and solution.

AaronDonahue commented 2 days ago

@mahf708, can you share the YAML file for these outputs?

mahf708 commented 2 days ago

Here's the output yaml: https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/3969187877/1ma+ne30pg2.yaml, which is a verbatim copy of the outputs Ben is using (circa May 1) but with small additions.

AaronDonahue commented 1 day ago

thanks, I'll start working on this.

AaronDonahue commented 1 day ago

Does this happen w/ a restarted run?

mahf708 commented 1 day ago

Does this happen w/ a restarted run?

We will unlikely find a deterministic reproducer for this in any short period of time. This happened in two runs, in two separate occasions in each, so four times total. Here's how it played out (roughly)

The wildest thing? It starts behaving normally.

The short answer, yes, this can only happen in restarts. I think it is important to consider all four issues I filed so far as one large issue (I suspect they are related).

Note in OP:

1ma_ne30pg2.AVERAGE.nmonths_x1.2019-08-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-09-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-10-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-11-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-12-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-01-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-01-01
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-04-01-00000.nc >>>>>>>>>>>>>>> 2 files gone, 1 misnamed
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-05-01-00000.nc 
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-06-01
                                                   >>>>>>>>>>>>>>> simulation ends; 2 files gone, 1 misnamed