Closed liuchihl closed 1 month ago
Can you put together a minimal working example that illustrates the issue?
I attempted to reproduce the issue using the 1D diffusion example in the same environment, but I was unable to do so. After picking up the checkpoint, the output saving interval looked normal (not saving every iteration). The simple example is demonstrated as follows: here.
Our initial guess is that it might be related to #3056. However, after conducting some tests, such as avoiding setting intervals to transcendental numbers, the output saving interval after picking up the checkpoint is still 1 iteration for a while (which is not the desired behavior).
I noticed that when I use IterationInterval
instead of TimeInterval
, the problem is resolved.
I attempted to reproduce the issue using the 1D diffusion example in the same environment, but I was unable to do so.
Ok, after changing from IterationInterval
to TimeInterval
in the 1D diffusion example (source), I am able to reproduce the same problem now.
Here is the progress message after picking up the checkpoint:
Can you paste your minimal example here?
If I understood correctly, I think that @glwagner just wanted to have the code directly pasted here.
using Oceananigans
using Oceananigans.Units
using Printf
Ns = 200 # number of time saves
T = 7200 # simulation stop time (s)
Δt = 1 # timestep (s)
grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid, timestepper=:RungeKutta3)
simulation = Simulation(model; Δt, stop_time = T)
progress_message(sim) = @printf("Iteration: %03d, time: %s, Δt: %s, wall time: %s\n",
iteration(sim), prettytime(sim), prettytime(sim.Δt), prettytime(sim.run_wall_time))
simulation.callbacks[:progress] = Callback(progress_message, TimeInterval(T/Ns))
dir = "output/test_MWE"
## checkpoint
simulation.output_writers[:checkpointer] = Checkpointer(
model,
schedule=TimeInterval(T),
dir=dir,
prefix=string("checkpoint"),
cleanup=false)
file = string(dir,"/checkpoint_iteration3600.jld2")
run!(simulation,pickup=file)
Ah, I think you are right! Thanks for pasting that.
Nice, thanks. Yes it helps if we don't have to click links (especially if the links aren't permanent, we lose documentation of this issue).
Okay, here's my MWE which includes running the first simulation to generate the checkpoint file:
using Oceananigans
using Printf
""" Set up a simple simulation to test picking up from a checkpoint. """
function test_simulation(stop_time, Δt, δt)
grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid)
simulation = Simulation(model; Δt, stop_time)
progress_message(sim) = @info string("Iter: ", iteration(sim), ", time: ", prettytime(sim))
simulation.callbacks[:progress] = Callback(progress_message, TimeInterval(δt))
checkpointer = Checkpointer(model,
schedule = TimeInterval(stop_time),
prefix = "test",
cleanup = false)
simulation.output_writers[:checkpointer] = checkpointer
return simulation
end
rm("test_iteration*.jld2", force=true)
Δt = 1 # timestep (s)
T1 = 4 # first simulation stop time (s)
T2 = 2T1 # second simulation stop time (s)
δt = 2 # progress message frequency
# Run a simulation that saves data to a checkpoint
simulation = test_simulation(T1, Δt, δt)
run!(simulation)
# Now try again, but picking up from the previous checkpoint
N = iteration(simulation)
checkpoint = "test_iteration$N.jld2"
simulation = test_simulation(T2, Δt, δt)
run!(simulation, pickup=checkpoint)
This reproduces the issue because I get
julia> include("test.jl")
[ Info: Initializing simulation...
[ Info: Iter: 0, time: 0 seconds
[ Info: ... simulation initialization complete (2.697 seconds)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (5.721 seconds).
[ Info: Iter: 2, time: 2 seconds
[ Info: Simulation is stopping after running for 8.786 seconds.
[ Info: Simulation time 4 seconds equals or exceeds stop time 4 seconds.
[ Info: Iter: 4, time: 4 seconds
[ Info: Initializing simulation...
[ Info: ... simulation initialization complete (529.973 μs)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (575.447 μs).
[ Info: Iter: 5, time: 5 seconds
[ Info: Iter: 6, time: 6 seconds
[ Info: Iter: 7, time: 7 seconds
[ Info: Simulation is stopping after running for 5.469 ms.
[ Info: Simulation time 8 seconds equals or exceeds stop time 8 seconds.
[ Info: Iter: 8, time: 8 seconds
When the second simulation runs, I think we expect to see Iter: 6
, and then Iter: 8
after the simulation stops.
I think the problem is basically that the schedules are not saved to the checkpoint. Actually, the output themselves are also not saved to checkpoint which is also an issue for time averages.
I believe this wasn't previously an issue but recent changes to TimeInterval
in #3616 may have created the problem...
Likely we can find some simple way to fix TimeInterval
but let's keep in mind that there are some broader challenges to be solved for checkpointing that basically will require a feature that can save callback and output writer states.
Here's a clue: this fixes the issue
Δt = 1 # timestep (s)
T1 = 4 # first simulation stop time (s)
T2 = 2T1 # second simulation stop time (s)
δt = 2 # progress message frequency
# Run a simulation that saves data to a checkpoint
simulation = test_simulation(T1, Δt, δt)
run!(simulation)
progress_cb = simulation.callbacks[:progress]
@show actuations = progress_cb.schedule.actuations
# Now try again, but picking up from the previous checkpoint
N = iteration(simulation)
checkpoint = "test_iteration$N.jld2"
simulation = test_simulation(T2, Δt, δt)
progress_cb = simulation.callbacks[:progress]
progress_cb.schedule.actuations = actuations
run!(simulation, pickup=checkpoint)
Here's an even simpler MWE that illustrates the fundamental issue:
using Oceananigans
using Printf
grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid)
simulation = Simulation(model; Δt=1, stop_time=6)
progress_message(sim) = @info string("Iter: ", iteration(sim), ", time: ", prettytime(sim))
simulation.callbacks[:progress] = Callback(progress_message, TimeInterval(2))
# Run a simulation that saves data to a checkpoint
model.clock.iteration = 1 # we want to start here for some reason
run!(simulation)
which produces
julia> include("test2.jl")
[ Info: Initializing simulation...
[ Info: ... simulation initialization complete (383.622 μs)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (620.680 μs).
[ Info: Iter: 2, time: 1 second
[ Info: Iter: 3, time: 2 seconds
[ Info: Iter: 5, time: 4 seconds
[ Info: Iter: 7, time: 6 seconds
[ Info: Simulation is stopping after running for 25.701 ms.
[ Info: Simulation time 8 seconds equals or exceeds stop time 8 seconds.
[ Info: Iter: 9, time: 8 seconds
Basically here there is a "spurious actuation" at the first iteration (here iteration 2, because we started from iteration 1).
This fixes the issue:
using Oceananigans
using Printf
grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid)
simulation = Simulation(model; Δt=1, stop_time=6)
progress_message(sim) = @info string("Iter: ", iteration(sim), ", time: ", prettytime(sim))
progress_cb = Callback(progress_message, TimeInterval(2))
simulation.callbacks[:progress] = progress_cb
# Run a simulation that saves data to a checkpoint
model.clock.iteration = 1 # we want to start here for some reason
progress_cb.schedule.actuations = 1
run!(simulation)
producing
julia> include("test2.jl")
[ Info: Initializing simulation...
[ Info: ... simulation initialization complete (595.408 μs)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (876.240 μs).
[ Info: Iter: 3, time: 2 seconds
[ Info: Iter: 5, time: 4 seconds
[ Info: Simulation is stopping after running for 51.452 ms.
[ Info: Simulation time 6 seconds equals or exceeds stop time 6 seconds.
[ Info: Iter: 7, time: 6 seconds
Hopefully #3660 works and also doesn't break anything else
@glwagner Thanks for solving this issue, I've tested your PR, and it works great!
Great! FYI we like to keep the issue open until the PR is actually merged into main
. Since the issue is linked, it will automatically get closed when the PR is merged so no need for manual intervention. I've opened it up to let that happen.
Ah, gotcha!
Hi all, I have an issue with post-checkpoint runs. I've set the output saving interval to 30 minutes, but for the first few hours after picking up the checkpoint, the saving interval is actually equal to Δt (approximately 30 seconds in my case). Additionally, the progress message is generated every iteration (30 seconds) instead of every 30 minutes as intended. This strange behavior continues until iteration 13398, after which the interval and progress messages behave correctly (see the progress message below). It is unclear to me what went wrong. Any ideas would be much appreciated! (I can provide the code about how I set the
outputwriter
and progress message if needed; I am using Oceananigans v0.91.3, and Julia/1.10.2)