CliMA / Oceananigans.jl

🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs
https://clima.github.io/OceananigansDocumentation/stable
MIT License
958 stars 191 forks source link

Post-Checkpoint Output Saving Interval Issue #3657

Closed liuchihl closed 1 month ago

liuchihl commented 1 month ago

Hi all, I have an issue with post-checkpoint runs. I've set the output saving interval to 30 minutes, but for the first few hours after picking up the checkpoint, the saving interval is actually equal to Δt (approximately 30 seconds in my case). Additionally, the progress message is generated every iteration (30 seconds) instead of every 30 minutes as intended. This strange behavior continues until iteration 13398, after which the interval and progress messages behave correctly (see the progress message below). It is unclear to me what went wrong. Any ideas would be much appreciated! (I can provide the code about how I set the outputwriter and progress message if needed; I am using Oceananigans v0.91.3, and Julia/1.10.2)

image

glwagner commented 1 month ago

Can you put together a minimal working example that illustrates the issue?

liuchihl commented 1 month ago

I attempted to reproduce the issue using the 1D diffusion example in the same environment, but I was unable to do so. After picking up the checkpoint, the output saving interval looked normal (not saving every iteration). The simple example is demonstrated as follows: here.

Our initial guess is that it might be related to #3056. However, after conducting some tests, such as avoiding setting intervals to transcendental numbers, the output saving interval after picking up the checkpoint is still 1 iteration for a while (which is not the desired behavior).

I noticed that when I use IterationInterval instead of TimeInterval, the problem is resolved.

liuchihl commented 1 month ago

I attempted to reproduce the issue using the 1D diffusion example in the same environment, but I was unable to do so.

Ok, after changing from IterationInterval to TimeInterval in the 1D diffusion example (source), I am able to reproduce the same problem now.

Here is the progress message after picking up the checkpoint:

image
glwagner commented 1 month ago

Can you paste your minimal example here?

liuchihl commented 1 month ago

Yes, of course, here is the MWE.

iuryt commented 1 month ago

If I understood correctly, I think that @glwagner just wanted to have the code directly pasted here.

using Oceananigans
using Oceananigans.Units
using Printf

Ns = 200    # number of time saves
T = 7200 # simulation stop time (s)
Δt = 1  # timestep (s)

grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))

model = NonhydrostaticModel(; grid, timestepper=:RungeKutta3)

simulation = Simulation(model; Δt, stop_time = T)

progress_message(sim) = @printf("Iteration: %03d, time: %s, Δt: %s, wall time: %s\n",
    iteration(sim), prettytime(sim), prettytime(sim.Δt), prettytime(sim.run_wall_time))

simulation.callbacks[:progress] = Callback(progress_message, TimeInterval(T/Ns))

dir = "output/test_MWE"

## checkpoint  
simulation.output_writers[:checkpointer] = Checkpointer(
                            model,
                            schedule=TimeInterval(T),
                            dir=dir,
                            prefix=string("checkpoint"),
                            cleanup=false)
file = string(dir,"/checkpoint_iteration3600.jld2")
run!(simulation,pickup=file)
liuchihl commented 1 month ago

Ah, I think you are right! Thanks for pasting that.

glwagner commented 1 month ago

Nice, thanks. Yes it helps if we don't have to click links (especially if the links aren't permanent, we lose documentation of this issue).

glwagner commented 1 month ago

Okay, here's my MWE which includes running the first simulation to generate the checkpoint file:

using Oceananigans
using Printf

""" Set up a simple simulation to test picking up from a checkpoint. """
function test_simulation(stop_time, Δt, δt)
    grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
    model = NonhydrostaticModel(; grid)
    simulation = Simulation(model; Δt, stop_time)

    progress_message(sim) = @info string("Iter: ", iteration(sim), ", time: ", prettytime(sim))
    simulation.callbacks[:progress] = Callback(progress_message, TimeInterval(δt))

    checkpointer = Checkpointer(model,
                                schedule = TimeInterval(stop_time),
                                prefix = "test",
                                cleanup = false)

    simulation.output_writers[:checkpointer] = checkpointer

    return simulation
end

rm("test_iteration*.jld2", force=true)

Δt = 1    # timestep (s)
T1 = 4    # first simulation stop time (s)
T2 = 2T1  # second simulation stop time (s)
δt = 2    # progress message frequency

# Run a simulation that saves data to a checkpoint
simulation = test_simulation(T1, Δt, δt)
run!(simulation)

# Now try again, but picking up from the previous checkpoint
N = iteration(simulation)
checkpoint = "test_iteration$N.jld2"
simulation = test_simulation(T2, Δt, δt)
run!(simulation, pickup=checkpoint)

This reproduces the issue because I get

julia> include("test.jl")
[ Info: Initializing simulation...
[ Info: Iter: 0, time: 0 seconds
[ Info:     ... simulation initialization complete (2.697 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (5.721 seconds).
[ Info: Iter: 2, time: 2 seconds
[ Info: Simulation is stopping after running for 8.786 seconds.
[ Info: Simulation time 4 seconds equals or exceeds stop time 4 seconds.
[ Info: Iter: 4, time: 4 seconds
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (529.973 μs)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (575.447 μs).
[ Info: Iter: 5, time: 5 seconds
[ Info: Iter: 6, time: 6 seconds
[ Info: Iter: 7, time: 7 seconds
[ Info: Simulation is stopping after running for 5.469 ms.
[ Info: Simulation time 8 seconds equals or exceeds stop time 8 seconds.
[ Info: Iter: 8, time: 8 seconds

When the second simulation runs, I think we expect to see Iter: 6, and then Iter: 8 after the simulation stops.

glwagner commented 1 month ago

I think the problem is basically that the schedules are not saved to the checkpoint. Actually, the output themselves are also not saved to checkpoint which is also an issue for time averages.

I believe this wasn't previously an issue but recent changes to TimeInterval in #3616 may have created the problem...

Likely we can find some simple way to fix TimeInterval but let's keep in mind that there are some broader challenges to be solved for checkpointing that basically will require a feature that can save callback and output writer states.

glwagner commented 1 month ago

Here's a clue: this fixes the issue

Δt = 1    # timestep (s)
T1 = 4    # first simulation stop time (s)
T2 = 2T1  # second simulation stop time (s)
δt = 2    # progress message frequency

# Run a simulation that saves data to a checkpoint
simulation = test_simulation(T1, Δt, δt)
run!(simulation)

progress_cb = simulation.callbacks[:progress]
@show actuations = progress_cb.schedule.actuations

# Now try again, but picking up from the previous checkpoint
N = iteration(simulation)
checkpoint = "test_iteration$N.jld2"
simulation = test_simulation(T2, Δt, δt)
progress_cb = simulation.callbacks[:progress]
progress_cb.schedule.actuations = actuations
run!(simulation, pickup=checkpoint)
glwagner commented 1 month ago

Here's an even simpler MWE that illustrates the fundamental issue:

using Oceananigans
using Printf

grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid)
simulation = Simulation(model; Δt=1, stop_time=6)

progress_message(sim) = @info string("Iter: ", iteration(sim), ", time: ", prettytime(sim))
simulation.callbacks[:progress] = Callback(progress_message, TimeInterval(2))

# Run a simulation that saves data to a checkpoint
model.clock.iteration = 1 # we want to start here for some reason
run!(simulation)

which produces

julia> include("test2.jl")
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (383.622 μs)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (620.680 μs).
[ Info: Iter: 2, time: 1 second
[ Info: Iter: 3, time: 2 seconds
[ Info: Iter: 5, time: 4 seconds
[ Info: Iter: 7, time: 6 seconds
[ Info: Simulation is stopping after running for 25.701 ms.
[ Info: Simulation time 8 seconds equals or exceeds stop time 8 seconds.
[ Info: Iter: 9, time: 8 seconds

Basically here there is a "spurious actuation" at the first iteration (here iteration 2, because we started from iteration 1).

This fixes the issue:

using Oceananigans
using Printf

grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid)
simulation = Simulation(model; Δt=1, stop_time=6)

progress_message(sim) = @info string("Iter: ", iteration(sim), ", time: ", prettytime(sim))
progress_cb = Callback(progress_message, TimeInterval(2))
simulation.callbacks[:progress] = progress_cb

# Run a simulation that saves data to a checkpoint
model.clock.iteration = 1 # we want to start here for some reason
progress_cb.schedule.actuations = 1
run!(simulation)

producing

julia> include("test2.jl")
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (595.408 μs)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (876.240 μs).
[ Info: Iter: 3, time: 2 seconds
[ Info: Iter: 5, time: 4 seconds
[ Info: Simulation is stopping after running for 51.452 ms.
[ Info: Simulation time 6 seconds equals or exceeds stop time 6 seconds.
[ Info: Iter: 7, time: 6 seconds
glwagner commented 1 month ago

Hopefully #3660 works and also doesn't break anything else

liuchihl commented 1 month ago

@glwagner Thanks for solving this issue, I've tested your PR, and it works great!

glwagner commented 1 month ago

Great! FYI we like to keep the issue open until the PR is actually merged into main. Since the issue is linked, it will automatically get closed when the PR is merged so no need for manual intervention. I've opened it up to let that happen.

liuchihl commented 1 month ago

Ah, gotcha!