Flush frequency in yaml outputs

ndkeen commented 5 months ago

I recently had a case running (happened to be PPE member) that was beyond second day and I needed to cancel it, thinking we already had the data written for that second day. Afterwards, looking at data, the file is there, but empty.

In atm.log, it does indicate we are "done" with the file:

[EAMxx::output_manager] - Writing model-output:
[EAMxx::output_manager]      FILE: output.scream.AutoCal.daily_avg_cosp_ne30pg2.AVERAGE.nhours_x24.2016-08-07-00000.nc
[EAMxx::scorpio_output] Writing variables to file
  file name: output.scream.AutoCal.daily_avg_cosp_ne30pg2.AVERAGE.nhours_x24.2016-08-07-00000.nc
  Done! Elapsed time: 0.004000 seconds
Atmosphere step = 6048
  model start-of-step time = 2016-08-08 00:00:00

Atmosphere step = 6049
  model start-of-step time = 2016-08-08 00:01:40

@bartgol explains that it might be scorpio not flushing and we have some control by adding flush_frequency: 1 to the yamls.

There must be some perf impact of doing this, but unless it's severe, I would think we would generally want this? Could it actually explain why some of the data from Cess sims are missing?

bartgol commented 5 months ago

@jayeshkrishna do you know how big of an impact we'd have if we flushed the output file after every write? I'm assuming it's non negligible, but maybe still relatively small?

Edit: I don't mean "after each write_darray call", but rather "after all the write_darray and put_var calls within a timestep"...

AaronDonahue commented 1 month ago

@bartgol , @jayeshkrishna I want to bring this issue back to life. We had a discussion about this in the eval call today.

AaronDonahue commented 1 month ago

@crterai can you comment briefly on how this impacted the CESS sims?

crterai commented 1 month ago

We had portions of the Cess production run that we're having to re-run because we are missing outputs from certain periods. We got this when the model crashed pretty close to a restart write and one of the output files was still filling up but hadn't flushed. And example is in /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/cess-oct2/cess-control.ne1024pg2_ne1024pg2.F2010-SCREAMv1.cess-oct2/run

On 2020-04-20-03600 the output.scream.Cess.hourly2DVars. output stream started writing a new file.
On 2020-04-22-00000 a restart was written, but the output.scream.Cess.hourly2DVars. output stream wasn't flushed.
On 2020-04-22 17:21:40 the model crashed and output.scream.Cess.hourly2DVars.INSTANT.nhours_x1.2020-04-20-03600.nc remained empty because the output hadn't flushed.
When we went to restart the model, we started on 2020-04-22-00000. At that point, we started writing to a new file output.scream.Cess.hourly2DVars.INSTANT.nhours_x1.2020-04-22-03600.nc. That left us missing the data for 2020-04-20-03600 to 2020-04-22-00000.

E3SM-Project / scream

Flush frequency in yaml outputs #2766