E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
73 stars 52 forks source link

Flush frequency in yaml outputs #2766

Open ndkeen opened 5 months ago

ndkeen commented 5 months ago

I recently had a case running (happened to be PPE member) that was beyond second day and I needed to cancel it, thinking we already had the data written for that second day. Afterwards, looking at data, the file is there, but empty.

In atm.log, it does indicate we are "done" with the file:

[EAMxx::output_manager] - Writing model-output:
[EAMxx::output_manager]      FILE: output.scream.AutoCal.daily_avg_cosp_ne30pg2.AVERAGE.nhours_x24.2016-08-07-00000.nc
[EAMxx::scorpio_output] Writing variables to file
  file name: output.scream.AutoCal.daily_avg_cosp_ne30pg2.AVERAGE.nhours_x24.2016-08-07-00000.nc
  Done! Elapsed time: 0.004000 seconds
Atmosphere step = 6048
  model start-of-step time = 2016-08-08 00:00:00

Atmosphere step = 6049
  model start-of-step time = 2016-08-08 00:01:40

@bartgol explains that it might be scorpio not flushing and we have some control by adding flush_frequency: 1 to the yamls.

There must be some perf impact of doing this, but unless it's severe, I would think we would generally want this? Could it actually explain why some of the data from Cess sims are missing?

bartgol commented 5 months ago

@jayeshkrishna do you know how big of an impact we'd have if we flushed the output file after every write? I'm assuming it's non negligible, but maybe still relatively small?

Edit: I don't mean "after each write_darray call", but rather "after all the write_darray and put_var calls within a timestep"...

AaronDonahue commented 1 month ago

@bartgol , @jayeshkrishna I want to bring this issue back to life. We had a discussion about this in the eval call today.

AaronDonahue commented 1 month ago

@crterai can you comment briefly on how this impacted the CESS sims?

crterai commented 1 month ago

We had portions of the Cess production run that we're having to re-run because we are missing outputs from certain periods. We got this when the model crashed pretty close to a restart write and one of the output files was still filling up but hadn't flushed. And example is in /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/cess-oct2/cess-control.ne1024pg2_ne1024pg2.F2010-SCREAMv1.cess-oct2/run