CliMA / Oceananigans.jl

🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs
https://clima.github.io/OceananigansDocumentation/stable
MIT License
988 stars 194 forks source link

Immediate NaNs running a non-hydrostatic model with an immersed boundary and open boundary conditions and immersed pressure solver #3831

Closed ali-ramadhan closed 2 weeks ago

ali-ramadhan commented 3 weeks ago

I'm trying to set up a small test where a zonal velocity forced by open boundary conditions goes around an immersed sea mount, using the new immersed pressure solver.

Unfortunately I get immediate NaNs on iteration 1. Maybe there's a bug somewhere? I know I'm using a bunch of experimental features together so perhaps this is not surprising. My setup could also be bad though.

Curious if anyone has any insights on what's going wrong here.

Does the pressure solver need to be modified to account for non-zero velocities at the boundaries? I guess the FFT pressure solver assumes either periodic or no-penetration at the boundaries, but then shouldn't the conjugate gradient solver converge on the correct pressure with enough iterations? Or maybe not if the pre-conditioner is very wrong?

"MWE" setup:

using Printf
using Oceananigans
using Oceananigans.Solvers: ConjugateGradientPoissonSolver, fft_poisson_solver

L = 100
H = 100

underlying_grid = RectilinearGrid(
    CPU(),
    Float64,
    topology = (Bounded, Bounded, Bounded),
    size = (16, 16, 16),
    x = (0, L),
    y = (0, L),
    z = (-H, 0)
)

h = H/2
w = L/5
mount(x, y) = h * exp(-x^2 / 2w^2) * exp(-y^2 / 2w^2)
bottom(x, y) = -H + mount(x, y)

grid = ImmersedBoundaryGrid(underlying_grid, GridFittedBottom(bottom))

@inline u_inflow(y, z, t) = 0.01

u_bcs = FieldBoundaryConditions(
    west = OpenBoundaryCondition(u_inflow),
    east = OpenBoundaryCondition(u_inflow)
)

boundary_conditions = (; u=u_bcs)

model = NonhydrostaticModel(;
    grid,
    boundary_conditions,
    timestepper = :RungeKutta3,
    pressure_solver = ConjugateGradientPoissonSolver(
        grid;
        preconditioner = fft_poisson_solver(grid.underlying_grid)
    )
)

simulation = Simulation(model; Δt=0.1, stop_time=60)

progress(sim) = @printf(
    "iteration: %d, time: %.4f, U_max=(%.2e, %.2e, %.2e)\n",
    iteration(simulation),
    time(simulation),
    maximum(abs, model.velocities.u),
    maximum(abs, model.velocities.v),
    maximum(abs, model.velocities.w)
)

simulation.callbacks[:progress] = Callback(progress, IterationInterval(1))

run!(simulation)

Output:

[ Info: Initializing simulation...
iteration: 0, time: 0.0000, U_max=(0.00e+00, 0.00e+00, 0.00e+00)
[ Info:     ... simulation initialization complete (6.266 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (20.727 seconds).
iteration: 1, time: 0.1000, U_max=(NaN, NaN, NaN)
iteration: 2, time: 0.2000, U_max=(NaN, NaN, NaN)
iteration: 3, time: 0.3000, U_max=(NaN, NaN, NaN)
iteration: 4, time: 0.4000, U_max=(NaN, NaN, NaN)
iteration: 5, time: 0.5000, U_max=(NaN, NaN, NaN)
iteration: 6, time: 0.6000, U_max=(NaN, NaN, NaN)
iteration: 7, time: 0.7000, U_max=(NaN, NaN, NaN)
iteration: 8, time: 0.8000, U_max=(NaN, NaN, NaN)

cc @tomchor

glwagner commented 3 weeks ago

I guess the FFT pressure solver assumes either periodic or no-penetration at the boundaries

This isn't true --- our pressure solver can handle any boundary conditions in principle.

Unfortunately I get immediate NaNs on iteration 1. Maybe there's a bug somewhere? I know I'm using a bunch of experimental features together so perhaps this is not surprising. My setup could also be bad though.

I believe this problem is ill-posed. I would first try periodic boundary conditions with a sponge layer on one side. You probably need a radiation condition on the right if you want to use open boundaries...

shouldn't the conjugate gradient solver converge on the correct pressure with enough iterations? Or maybe not if the pre-conditioner is very wrong?

You can test this by omitting the preconditioner.

It might be appropriate to convert this to a discussion, at least until we can prove there is a bug to be fixed.

tomchor commented 3 weeks ago

@ali-ramadhan interestingly, when I run your exact example things take a little longer to blow up. I wonder what gives:

[ Info:     ... simulation initialization complete (7.057 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (36.179 seconds).
iteration: 1, time: 0.1000, U_max=(1.31e-02, 4.98e-03, 5.51e-03)
iteration: 2, time: 0.2000, U_max=(1.30e-02, 4.96e-03, 5.50e-03)
iteration: 3, time: 0.3000, U_max=(1.39e-02, 5.50e-03, 5.45e-03)
iteration: 4, time: 0.4000, U_max=(NaN, NaN, NaN)
iteration: 5, time: 0.5000, U_max=(NaN, NaN, NaN)
iteration: 6, time: 0.6000, U_max=(NaN, NaN, NaN)

In any case, when I don't specify the pressure solver and let default to the FFT-based one, things seem to run fine. So imo it does seem to point to the conjugate gradient solver struggling to find a valid solution close to the immersed boundary.

PS: There seems to be a small bug when specifying the conjugate solver without specifying the preconditioner:

julia> ConjugateGradientPoissonSolver(grid)
ERROR: UndefVarError: `ImmersedBoundaryGrid` not defined
Stacktrace:
 [1] ConjugateGradientPoissonSolver(grid::ImmersedBoundaryGrid{…}; preconditioner::Oceananigans.Solvers.DefaultPreconditioner, reltol::Float64, abstol::Float64, kw::@Kwargs{})
   @ Oceananigans.Solvers ~/repos/Oceananigans.jl2/src/Solvers/conjugate_gradient_poisson_solver.jl:54
 [2] ConjugateGradientPoissonSolver(grid::ImmersedBoundaryGrid{Float64, Bounded, Bounded, Bounded, RectilinearGrid{…}, GridFittedBottom{…}, Nothing, Nothing, CPU})
   @ Oceananigans.Solvers ~/repos/Oceananigans.jl2/src/Solvers/conjugate_gradient_poisson_solver.jl:47
jagoosw commented 3 weeks ago

I tried running a non hydrostatic case with open boundaries and a coastal bathymetry the other week and it also NaNed very quickly with the immersed pressure solver but ran fine ish with FFT.

I assumed it was my set up but perhaps there is an issue with how the pressure solver deals with boundary conditions?

glwagner commented 2 weeks ago

I tried running a non hydrostatic case with open boundaries and a coastal bathymetry the other week and it also NaNed very quickly with the immersed pressure solver but ran fine ish with FFT.

Can you produce some code or an MWE? I'm interested in working on this, but I have no cases that have an issue so there's little I can do to make progress. How is the problem forced?

I believe @ali-ramadhan's case is ill-posed as I stated. We can try to test this by using a sponge layer (or perhaps proper open boundary conditions) rather than

@inline u_inflow(y, z, t) = 0.01

u_bcs = FieldBoundaryConditions(
    west = OpenBoundaryCondition(u_inflow),
    east = OpenBoundaryCondition(u_inflow)
)

which I don't think will work.

perhaps there is an issue with how the pressure solver deals with boundary conditions?

Both solvers treat boundary conditions the same way:

https://github.com/CliMA/Oceananigans.jl/blob/0210e6646d6bab93e2c2d579b7e389e1ccc0c5db/src/Models/NonhydrostaticModels/pressure_correction.jl#L13-L15

(I can explain more how the math works out if there is interest.)

Can you set maxiter=10 in the solver and see if you still get NaNs?

I also suggest playing with reltol and abstol.

glwagner commented 2 weeks ago

I also suggest trying DiagonallyDominantPreconditioner:

https://github.com/CliMA/Oceananigans.jl/blob/0210e6646d6bab93e2c2d579b7e389e1ccc0c5db/src/Solvers/conjugate_gradient_poisson_solver.jl#L137

tomchor commented 2 weeks ago

I believe @ali-ramadhan's case is ill-posed as I stated. We can try to test this by using a sponge layer (or perhaps proper open boundary conditions) rather than

@inline u_inflow(y, z, t) = 0.01

u_bcs = FieldBoundaryConditions(
    west = OpenBoundaryCondition(u_inflow),
    east = OpenBoundaryCondition(u_inflow)
)

which I don't think will work.

For the record, I did try this yesterday with FlatExtrapolationOpenBCs and it also blew-up.

glwagner commented 2 weeks ago

For the record, I did try this yesterday with FlatExtrapolationOpenBCs and it also blew-up.

Ok great. How about the same problem (including CG pressure solver) but no immersed boundaries?

Something to keep an eye on is the number of iterations the CG solver does. It'd be good to report what is going on with those during / prior to breakup. You can call iteration(model.pressure_solver) to diagnose it (where it will correspond to the previous solve).

interestingly, when I run your exact example things take a little longer to blow up

This is important; the dependence on time step should be diagnosed.

jagoosw commented 2 weeks ago

I'm drafting a reply to the rest but could the problem be that 0.92 doesn't have #3802 released?

jagoosw commented 2 weeks ago

I don't think I understand how this is ill-posed? It is over specified so will not produce physical results but I thought that without a radiating condition this should still not NaN straight away there should just be a lot of reflections from the boundaries?

Here is a simplified version of my code: https://github.com/jagoosw/OpenBoundaries.jl/blob/main/validation/bug_mwe.jl

I am using a matching scheme that advects the perturbation component (defined as the boundary adjacent minus imposed velocity) out of the domain, and relaxes to the imposed velocity at different rates depending on if it is an inflow or outflow (for the v component the timescale is Inf for outflows to allow it to maximally transmit information out). I can run it with no matching scheme but it needs tiny timesteps because the noise at the boundaries becomes massive.

When I use the default pressure solver it kind of works. There are some bugs, for example, there is this weird jet generation on the southern inflowing boundary. I think these would be solved with relaxing regions.

https://github.com/user-attachments/assets/a85e66e2-3da7-402a-b546-57a3860cef9c

If I run with the CG solver it NaNs "immediately" and is doing ~800 iterations. If I restrict the iterations it doesn't NaN as fast, but generates very high velocities in a random place:

https://github.com/user-attachments/assets/d7ea836e-f69c-4f3a-9559-c64216e95cb0

I think there is also an issue with my model that its not respecting when a boundary adjacent cell is immersed so I'll fix that and get back to you.

Perhaps the "immediate" NaNs are actually just from the timestep not being small enough as the reflections and bathymetry interactions make some very high velocities (in my case ~40x higher than the inflows)

ali-ramadhan commented 2 weeks ago

@ali-ramadhan interestingly, when I run your exact example things take a little longer to blow up.

I tried it some more and yes it seems like sometimes it takes a few time steps but it always blows up within the first ~5 iterations.

For the record, I did try this yesterday with FlatExtrapolationOpenBCs and it also blew-up.

I tried this too and it also blew up for me.

I'm drafting a reply to the rest but could the problem be that 0.92 doesn't have #3802 released?

Ah I should have mentioned this in my original post but I did run from an up-to-date main branch which had that PR merged in.

Will try some of @glwagner's suggestions and decrease the time step like @jagoosw did!

ali-ramadhan commented 2 weeks ago

Here is a simplified version of my code: https://github.com/jagoosw/OpenBoundaries.jl/blob/main/validation/bug_mwe.jl

Was curious to have a look but I think the link might be dead?

ali-ramadhan commented 2 weeks ago

I updated the original MWE slightly to use a FlatExtrapolationOpenBoundaryCondition on the eastern boundary and print more info about the solver:

using Printf
using Statistics
using Oceananigans
using Oceananigans.BoundaryConditions: FlatExtrapolationOpenBoundaryCondition
using Oceananigans.Solvers: ConjugateGradientPoissonSolver, fft_poisson_solver
using Oceananigans.Simulations: NaNChecker

L = 100
H = 100

underlying_grid = RectilinearGrid(
    CPU(),
    Float64,
    topology = (Bounded, Bounded, Bounded),
    size = (16, 16, 16),
    x = (0, L),
    y = (0, L),
    z = (-H, 0)
)

h = H/2
w = L/5
mount(x, y) = h * exp(-x^2 / 2w^2) * exp(-y^2 / 2w^2)
bottom(x, y) = -H + mount(x, y)

grid = ImmersedBoundaryGrid(underlying_grid, GridFittedBottom(bottom))

@inline u_inflow(y, z, t) = 0.01

u_bcs = FieldBoundaryConditions(
    west = OpenBoundaryCondition(u_inflow),
    east = FlatExtrapolationOpenBoundaryCondition()
)

boundary_conditions = (; u=u_bcs)

model = NonhydrostaticModel(;
    grid,
    boundary_conditions,
    timestepper = :RungeKutta3,
    pressure_solver = ConjugateGradientPoissonSolver(
        grid;
        preconditioner = fft_poisson_solver(grid.underlying_grid)
    )
)

simulation = Simulation(model; Δt=0.01, stop_time=60)

function progress(sim)
    model = sim.model
    @printf(
        "iteration: %d, time: %.4f, U_max=(%.2e, %.2e, %.2e)\n",
        iteration(sim),
        time(sim),
        maximum(abs, model.velocities.u),
        maximum(abs, model.velocities.v),
        maximum(abs, model.velocities.w)
    )

    @printf(
        "              reltol=%.2e, abstol=%.2e, solver iterations: %d, residual: (mean=%.2e, abs(max)=%.2e)\n",
        model.pressure_solver.conjugate_gradient_solver.reltol,
        model.pressure_solver.conjugate_gradient_solver.abstol,
        iteration(model.pressure_solver),
        mean(model.pressure_solver.conjugate_gradient_solver.residual),
        maximum(abs, model.pressure_solver.conjugate_gradient_solver.residual)
    )
end

simulation.callbacks[:progress] = Callback(progress, IterationInterval(1))

nan_checker = NaNChecker(fields=model.velocities, erroring=true)
simulation.callbacks[:nan_checker] = Callback(nan_checker, IterationInterval(1))

run!(simulation)

Now I'm seeing that it's maxing out the solver iterations at 4096 and the residual (mean and max) is still like 5 orders of magnitude higher than tolerance.

[ Info: Initializing simulation...
iteration: 0, time: 0.0000, U_max=(0.00e+00, 0.00e+00, 0.00e+00)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 0, residual: (mean=0.00e+00, abs(max)=0.00e+00)
[ Info:     ... simulation initialization complete (109.482 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (4.840 seconds).
iteration: 1, time: 0.1000, U_max=(1.19e-02, 5.23e-03, 5.19e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 4096, residual: (mean=-1.78e-03, abs(max)=1.06e-02)
iteration: 2, time: 0.2000, U_max=(1.21e-02, 4.93e-03, 5.20e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 4096, residual: (mean=-9.30e-04, abs(max)=5.56e-03)
iteration: 3, time: 0.3000, U_max=(1.25e-02, 5.02e-03, 5.32e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 4096, residual: (mean=-4.85e-04, abs(max)=2.74e-03)
iteration: 4, time: 0.4000, U_max=(6.23e+24, 4.49e+24, 6.19e+24)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 4096, residual: (mean=-2.39e+23, abs(max)=1.51e+24)
iteration: 5, time: 0.5000, U_max=(5.48e+179, 4.19e+179, 9.91e+179)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 0, residual: (mean=-1.70e+174, abs(max)=1.13e+181)
ERROR: LoadError: time = 0.6, iteration = 6: NaN found in field u. Aborting simulation.

I tried reducing the time step from 0.1 to 0.01 but now it always blows up at iteration 1. Increasing the time step kept it blowing up after ~5 iterations.


You can test this by omitting the preconditioner.

Running with preconditioner = nothing causes it to always blow up on iteration 1 no matter the time step (after 4096 iterations).

[ Info: Initializing simulation...
iteration: 0, time: 0.0000, U_max=(0.00e+00, 0.00e+00, 0.00e+00)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 0, residual: (mean=0.00e+00, abs(max)=0.00e+00)
[ Info:     ... simulation initialization complete (99.042 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (1.896 seconds).
ERROR: LoadError: time = 0.1, iteration = 1: NaN found in field u. Aborting simulation.
ali-ramadhan commented 2 weeks ago

Can you set maxiter=10 in the solver and see if you still get NaNs?

This seems to have worked! But I'm not totally sure why.

Like with maxiter = 4096 at first the solver does not converge after 10 solver iterations. But the residual is decreasing with each simulation iteration then after ~25 simulations iterations the solver just needs 1-2 iterations.

iteration: 0, time: 0.0000, U_max=(0.00e+00, 0.00e+00, 0.00e+00)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 0, residual: (mean=0.00e+00, abs(max)=0.00e+00)
[ Info:     ... simulation initialization complete (106.623 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (69.963 ms).
iteration: 1, time: 0.1000, U_max=(1.19e-02, 5.22e-03, 5.18e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-1.78e-03, abs(max)=1.04e-02)
iteration: 2, time: 0.2000, U_max=(1.21e-02, 4.94e-03, 5.20e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-9.29e-04, abs(max)=5.49e-03)
iteration: 3, time: 0.3000, U_max=(1.25e-02, 5.04e-03, 5.33e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-4.83e-04, abs(max)=3.06e-03)
iteration: 4, time: 0.4000, U_max=(1.26e-02, 4.95e-03, 5.33e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-2.53e-04, abs(max)=1.43e-03)
iteration: 5, time: 0.5000, U_max=(1.27e-02, 4.98e-03, 5.36e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-1.32e-04, abs(max)=8.10e-04)
iteration: 6, time: 0.6000, U_max=(1.27e-02, 4.96e-03, 5.36e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-6.87e-05, abs(max)=3.99e-04)
iteration: 7, time: 0.7000, U_max=(1.27e-02, 4.97e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-3.58e-05, abs(max)=2.42e-04)
iteration: 8, time: 0.8000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-1.88e-05, abs(max)=1.01e-04)
iteration: 9, time: 0.9000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-9.84e-06, abs(max)=5.74e-05)
iteration: 10, time: 1.0000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-5.10e-06, abs(max)=2.59e-05)
iteration: 11, time: 1.1000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-2.64e-06, abs(max)=1.63e-05)
iteration: 12, time: 1.2000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-1.37e-06, abs(max)=7.12e-06)
iteration: 13, time: 1.3000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-7.14e-07, abs(max)=4.80e-06)
iteration: 14, time: 1.4000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-3.74e-07, abs(max)=2.00e-06)
iteration: 15, time: 1.5000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-1.95e-07, abs(max)=1.35e-06)
iteration: 16, time: 1.6000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-1.02e-07, abs(max)=5.43e-07)
iteration: 17, time: 1.7000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-5.32e-08, abs(max)=3.75e-07)
iteration: 18, time: 1.8000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-2.78e-08, abs(max)=1.41e-07)
iteration: 19, time: 1.9000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-1.45e-08, abs(max)=1.01e-07)
iteration: 20, time: 2.0000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-7.53e-09, abs(max)=3.77e-08)
iteration: 21, time: 2.1000, U_max=(1.27e-02, 4.96e-03, 5.38e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-3.88e-09, abs(max)=2.65e-08)
iteration: 22, time: 2.2000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-1.97e-09, abs(max)=1.05e-08)
iteration: 23, time: 2.3000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-9.57e-10, abs(max)=5.45e-09)
iteration: 24, time: 2.4000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 10, residual: (mean=-4.15e-10, abs(max)=2.29e-09)
iteration: 25, time: 2.5000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 2, residual: (mean=-1.25e-10, abs(max)=1.28e-09)
iteration: 26, time: 2.6000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 1, residual: (mean=3.08e-11, abs(max)=8.51e-09)
iteration: 27, time: 2.7000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 1, residual: (mean=1.14e-10, abs(max)=5.97e-09)
iteration: 28, time: 2.8000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 1, residual: (mean=1.59e-10, abs(max)=5.57e-09)
iteration: 29, time: 2.9000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 2, residual: (mean=1.84e-10, abs(max)=1.20e-09)
iteration: 30, time: 3.0000, U_max=(1.27e-02, 4.96e-03, 5.37e-03)
              reltol=1.49e-08, abstol=1.49e-08, solver iterations: 2, residual: (mean=1.96e-10, abs(max)=1.46e-09)

I thought it was due to the solver diverging, but if you look at the residual at simulation iteration 1, the residual norm is still high for all 4096 solver iterations.

ali-ramadhan commented 2 weeks ago

Solution looks as expected at 16^3 :tada:

https://github.com/user-attachments/assets/0c5b0d83-9339-486e-b013-dfd67ec23946

Full script:

using Printf
using Statistics
using Oceananigans
using Oceananigans.BoundaryConditions: FlatExtrapolationOpenBoundaryCondition
using Oceananigans.Solvers: ConjugateGradientPoissonSolver, fft_poisson_solver
using Oceananigans.Simulations: NaNChecker
using Oceananigans.Utils: prettytime

L = 100
H = 100

underlying_grid = RectilinearGrid(
    CPU(),
    Float64,
    topology = (Bounded, Bounded, Bounded),
    size = (32, 32, 32),
    x = (-L/2, L/2),
    y = (-L/2, L/2),
    z = (-H, 0)
)

h = H/2
w = L/5
mount(x, y) = h * exp(-x^2 / 2w^2) * exp(-y^2 / 2w^2)
bottom(x, y) = -H + mount(x, y)

grid = ImmersedBoundaryGrid(underlying_grid, GridFittedBottom(bottom))

@inline u_inflow(y, z, t) = 0.01

u_bcs = FieldBoundaryConditions(
    west = OpenBoundaryCondition(u_inflow),
    east = FlatExtrapolationOpenBoundaryCondition()
)

boundary_conditions = (; u=u_bcs)

model = NonhydrostaticModel(;
    grid,
    boundary_conditions,
    timestepper = :RungeKutta3,
    pressure_solver = ConjugateGradientPoissonSolver(
        grid;
        preconditioner = fft_poisson_solver(grid.underlying_grid),
        maxiter = 100
    )
)

simulation = Simulation(model; Δt=0.1, stop_time=600)

function progress(sim)
    model = sim.model
    @printf(
        "iteration: %d, time: %.4f, U_max=(%.2e, %.2e, %.2e)\n",
        iteration(sim),
        time(sim),
        maximum(abs, model.velocities.u),
        maximum(abs, model.velocities.v),
        maximum(abs, model.velocities.w)
    )

    @printf(
        "              reltol=%.2e, abstol=%.2e, solver iterations: %d, residual: (mean=%.2e, abs(max)=%.2e)\n",
        model.pressure_solver.conjugate_gradient_solver.reltol,
        model.pressure_solver.conjugate_gradient_solver.abstol,
        iteration(model.pressure_solver),
        mean(model.pressure_solver.conjugate_gradient_solver.residual),
        maximum(abs, model.pressure_solver.conjugate_gradient_solver.residual)
    )
end

simulation.callbacks[:progress] = Callback(progress, IterationInterval(1))

nan_checker = NaNChecker(fields=model.velocities, erroring=true)
simulation.callbacks[:nan_checker] = Callback(nan_checker, IterationInterval(1))

simulation.output_writers[:fields] =
    JLD2OutputWriter(model, model.velocities; filename="3831.jld2", schedule=TimeInterval(10), overwrite_existing=true)

run!(simulation)

using CairoMakie

ds = FieldDataset("3831.jld2")

times = ds["u"].times
xu, yu, zu = nodes(ds["u"][1])

n = Observable(1)

fig = Figure(size=(1000, 500))

title = @lift @sprintf("time = %s", prettytime(times[$n]))

u_surface = @lift interior(ds["u"][$n], :, :, 16)
u_slice = @lift interior(ds["u"][$n], :, 8, :)

ax1 = Axis(fig[1, 1]; title = "u (surface)", xlabel="x", ylabel="y")
hm1 = heatmap!(ax1, xu, yu, u_surface, colorrange=(-0.01, 0.01), colormap=:balance)
Colorbar(fig[1, 2], hm1, label="m/s")

ax2 = Axis(fig[1, 3]; title = "u (xz-slice)", xlabel="x", ylabel="z")
hm2 = heatmap!(ax2, xu, yu, u_slice, colorrange=(-0.01, 0.01), colormap=:balance)
Colorbar(fig[1, 4], hm2, label="m/s")

fig[0, :] = Label(fig, title)

record(fig, "3831.mp4", 1:length(times), framerate=10) do i
    n[] = i
end
glwagner commented 2 weeks ago

This seems to have worked! But I'm not totally sure why.

If the FFT solver generates a stable solution but the CG solver diverges, then capping the iterations at a lowish number prevents the solution from diverging too far from the "decent" FFT solution and blowing up the simulation.

I thought it was due to the solver diverging, but if you look at the residual at simulation iteration 1, the residual norm is still high for all 4096 solver iterations.

Can you rephrase this, I don't grasp what you're trying to say. Don't we expect the residual to be large or increasing if the solver is diverging?

glwagner commented 2 weeks ago

I don't think I understand how this is ill-posed? It is over specified so will not produce physical results but I thought that without a radiating condition this should still not NaN straight away there should just be a lot of reflections from the boundaries?

By "ill-posed" I was referring to the over-specification. But if you say that we should find a solution / the problem should not blow up then I agree, "ill-posed" is the wrong word...

@ali-ramadhan this could be tested by running the problem without immersed boundaries but still in a scenario in which a non-trivial flow is generated.

Also curious whether the successful simulation you produced above also works with OpenBoundaryCondition on the right instead of FlatExtrapolationOpenBoundaryCondition

jagoosw commented 2 weeks ago

Here is a simplified version of my code: https://github.com/jagoosw/OpenBoundaries.jl/blob/main/validation/bug_mwe.jl

Was curious to have a look but I think the link might be dead?

I forgot to make the repo public, it should work now

glwagner commented 2 weeks ago

I also want to drop another optimization that can be used here --- using a Float32 grid with the preconditioner. This will speed things up a bit. Eg:

using Oceananigans.Grids: with_number_type

reduced_precision_grid = with_number_type(Float32, underlying_grid)

pressure_solver = ConjugateGradientPoissonSolver(grid;
    preconditioner = fft_poisson_solver(reduced_precision_grid),
    maxiter = 10
)

I also suggest plotting the divergence to get a handle on how the solver is working. A well-converged solution seems to have isotropically-distributed divergence errors. With looser tolerances, the divergence may be concentrated around the bathymetry (and of course with the naive FFT solver it has a thin layer adjacent to the bathymetry).

I also think it would be nice if setting maxiter=0 had a similar effect as simply using the FFT-based preconditioner. It's not the case right now, I think because we do not apply the preconditioner to the pressure initially before doing the CG iteration. It might not work (I believe I experimented briefly with that to no avail).

glwagner commented 2 weeks ago

If I restrict the iterations it doesn't NaN as fast, but generates very high velocities in a random place:

What if you restrict to 3-5 iterations?

glwagner commented 2 weeks ago

Ok I did a bit of investigation and found the following (script pasted below):

I'd like to analyze the divergence more, but it seems that it is substantially higher with FlatExtrapolationOpenBoundaryCondition. I would also like to understand the convergence rate, which seems slower than with a prescribed outflow.

From these results we see that maxiter should probably be set right now as a matter of practice. I think that even a handful of CG iterations is probably preferred to simply using the FFT solver naively with no CG iteration. There's also future work to understand the interaction of the solver with open boundary conditions.

Finally I would like to point out that there is probably little to gain, from a computational point of view, in using the FFT-based solver with 100+ CG iterations. If we are using 100+ iterations, we might as well use a faster preconditioner.

My script adapted from @ali-ramadhan's:

using Printf
using Statistics
using Oceananigans
using Oceananigans.Grids: with_number_type
using Oceananigans.BoundaryConditions: FlatExtrapolationOpenBoundaryCondition
using Oceananigans.Solvers: ConjugateGradientPoissonSolver, fft_poisson_solver
using Oceananigans.Utils: prettytime

N = 16
h, w = 50, 20
H, L = 100, 100
x = y = (-L/2, L/2)
z = (-H, 0)

grid = RectilinearGrid(size=(N, N, N); x, y, z, halo=(2, 2, 2), topology=(Bounded, Periodic, Bounded))

mount(x, y=0) = h * exp(-(x^2 + y^2) / 2w^2)
bottom(x, y=0) = -H + mount(x, y)
grid = ImmersedBoundaryGrid(grid, GridFittedBottom(bottom))

prescribed_flow = OpenBoundaryCondition(0.01)
extrapolation_bc = FlatExtrapolationOpenBoundaryCondition()
u_bcs = FieldBoundaryConditions(west = prescribed_flow,
                                east = extrapolation_bc)
                                #east = prescribed_flow)

boundary_conditions = (; u=u_bcs)
reduced_precision_grid = with_number_type(Float32, grid.underlying_grid)
preconditioner = fft_poisson_solver(reduced_precision_grid)
pressure_solver = ConjugateGradientPoissonSolver(grid; preconditioner, maxiter=20)

model = NonhydrostaticModel(; grid, boundary_conditions, pressure_solver)
simulation = Simulation(model; Δt=0.1, stop_iteration=1000)
conjure_time_step_wizard!(simulation, cfl=0.5)

u, v, w = model.velocities
δ = ∂x(u) + ∂y(v) + ∂z(w)

function progress(sim)
    model = sim.model
    u, v, w = model.velocities
    @printf("Iter: %d, time: %.1f, Δt: %.2e, max|δ|: %.2e",
            iteration(sim), time(sim), sim.Δt, maximum(abs, δ))

    r = model.pressure_solver.conjugate_gradient_solver.residual
    @printf(", solver iterations: %d, max|r|: %.2e\n",
            iteration(model.pressure_solver), maximum(abs, r))
end

add_callback!(simulation, progress)

simulation.output_writers[:fields] =
    JLD2OutputWriter(model, model.velocities; filename="3831.jld2", schedule=IterationInterval(10), overwrite_existing=true)

run!(simulation)

using GLMakie

ds = FieldDataset("3831.jld2")
fig = Figure(size=(1000, 500))

n = Observable(1)
times = ds["u"].times
title = @lift @sprintf("time = %s", prettytime(times[$n]))

Nx, Ny, Nz = size(grid)
j = round(Int, Ny/2)
k = round(Int, Nz/2)
u_surface = @lift view(ds["u"][$n], :, :, k)
u_slice = @lift view(ds["u"][$n], :, j, :)

ax1 = Axis(fig[1, 1]; title = "u (xy)", xlabel="x", ylabel="y")
hm1 = heatmap!(ax1, u_surface, colorrange=(-0.01, 0.01), colormap=:balance)
Colorbar(fig[1, 2], hm1, label="m/s")

ax2 = Axis(fig[1, 3]; title = "u (xz)", xlabel="x", ylabel="z")
hm2 = heatmap!(ax2, u_slice, colorrange=(-0.01, 0.01), colormap=:balance)
Colorbar(fig[1, 4], hm2, label="m/s")

fig[0, :] = Label(fig, title)

record(fig, "3831.mp4", 1:length(times), framerate=10) do i
    n[] = i
end
Yixiao-Zhang commented 2 weeks ago

I encountered similar issues months ago when using the CG Poisson solver. I fixed my code by slightly modifying the algorithm (https://github.com/Yixiao-Zhang/Oceananigans.jl/commit/c7983b8002b91cd5939018a7c999eae77e2105ac/). The reason is that the CG method can be numerically unstably for positive (or negative) semi-definite matrices.

These changes have not been merged into the main repository, and we are stilling discussing the math (#3802).

glwagner commented 2 weeks ago

I encountered similar issues months ago when using the CG Poisson solver. I fixed my code by slightly modifying the algorithm (Yixiao-Zhang@c7983b8/\). The reason is that the CG method can be numerically unstably for positive (or negative) semi-definite matrices.

Did you encounter the same issue whereby the simulation would immediately NaN (rather than intermittently)? I'd be curious to see your setup in order to have more than one working example to test with.

I've made a little progress with regularizing the solver (vs the simpler technique of setting the mean directly). The issue is that while regularization does seem to be able to stabilize the solver, we still experience extremely slow convergence with FlatExtrapolationOpenBoundaryCondition to the point that it's basically unusable practically, I think (at least if the same convergence rate is experienced at higher resolution / with better physics).

It's unclear whether these issues are generic to the solver or specific to this boundary condition, so having another unstable case would be useful.

Yixiao-Zhang commented 2 weeks ago

Did you encounter the same issue whereby the simulation would immediately NaN (rather than intermittently)? I'd be curious to see your setup in order to have more than one working example to test with.

In my original simulations, I encountered NaNs usually after several hours in wall time, so it was bad for debugging. Luckily, I found a reliable way to get NaNs immediately is to set both reltol and abstol to zero and maxiter to more than a thousand. The purpose is to test the numerical stability of the iteration method.

My theory is that the current CG iteration solver is numerically unstable. The residual usually decreases quickly for the first several iterations but may increase after that. That is why using a larger reltol or abstol or a lower maxiter makes it more "stable".

glwagner commented 2 weeks ago

Luckily, I found a reliable way to get NaNs immediately is to set both reltol and abstol to zero and maxiter to more than a thousand. The purpose is to test the numerical stability of the iteration method.

I think this makes sense. When the residual is reduced to near machine precision then I think this is when the present instability is exposed, which occurs when the search direction is essentially a constant.

I wonder if its possible that the instability was observed in the original simulations when, for some random reason of the flow configuration, the CG solution converged especially fast (thereby exposing the instability).

glwagner commented 2 weeks ago

After some of the work on #3848 I have suspicions that FlatExtrapolationOpenBoundaryCondition generates a difficult Poisson problem with strong sources / divergence along the boundary. But I also think that the CG solver needs some work. For the time being I definitely recommend setting maxiter=10 in production simulations.

Note that @jm-c made a comment that the MITgcm CG solver also has occasionally issues for "very smooth" problems. I think that issue is related to what @Yixiao-Zhang and @xkykai observed (possibly, because the FFT preconditioner is so effective, we observe this problem more often than MITgcm users do). On the other hand, the difficulty solving problems with FlatExtrapolationOpenBoundaryCondition seems to be somewhat distinct.

ali-ramadhan commented 2 weeks ago

I have suspicions that FlatExtrapolationOpenBoundaryCondition generates a difficult Poisson problem with strong sources / divergence along the boundary.

Can I understand it like this: with a fluid at rest the shock of enabling open boundaries at iteration 0, which has effectively zero timescale, creates a difficult Poisson problem? Once the simulation is chugging along, it seems like the Poisson problem is not as difficult.

I'll close this issue since the original MWE works now. It's just a matter of setting maxiter. And maybe we can continue discussing in a more general issue. But please feel free to reopen if you feel otherwise.

glwagner commented 2 weeks ago

Once the simulation is chugging along, it seems like the Poisson problem is not as difficult.

It does seem like the solution becomes more accurate in time (but unfortunately, the CG solver does not start to converge more quickly...)

The issue is interestingly related to this comment in MITgcm (which may be wrong by the way, but nevertheless reflects some observation about the CG solver):

https://github.com/MITgcm/MITgcm/blob/c7a06a2c29b2186c216f05f729b330d10e23c8be/model/src/cg3d.F#L86C1-L88C81

For this MWE, the RHS is 0 with OpenBoundaryCondition(0.01) on both sides, but not with FlatExtrapolationOpenBoundaryCondition(). @jm-c remarked that the RHS is also not zero for the Poisson equation beneath a free surface (which, in a z-coordinate, also essentially has an open boundary since $w(z=0) \ne 0$).

I'll close this issue since the original MWE works now. It's just a matter of setting maxiter.

I think this comment is true in a practical sense.

Possibly we do need another issue about the non-convergence of the CG solver. I think there is scope for someone to put in some effort to document this (eg come up with tools to evaluate convergence) and present a few nice MWEs (I think in addition to the present one with FlatExtrapolationOpenBoundaryCondition(), it would be nice to have a demonstration with impenetrable boundary conditions as well).

The way that the Poisson solver and open boundary conditions are intertwined is also a good reason to implement new open boundary condition matching schemes directly in Oceananigans @jagoosw .

Note that we are also thinking about how to develop an interface to Krylov.jl in #3803 which may open up additional avenues for progress on this.