CliMA / Oceananigans.jl

🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs
https://clima.github.io/OceananigansDocumentation/stable
MIT License
993 stars 195 forks source link

Reproducibility issue in LES results #2766

Closed fspereira1 closed 2 years ago

fspereira1 commented 2 years ago

Dear Oceananigans team,

We are trying to use Oceananigans to create reference LES solutions for multiple canonical flows. We use a cubical domain and random perturbations to trigger the flow. During our validation tests, we noticed that we could not reproduce the results, i.e., running the same .jl script (same initial flow conditions) leads to different averaged solutions (see attached picture). We ran more than 16 simulations and never obtained the same solution. We tried to set the seed of the random perturbations constant, but this did not solve the problem. Do you observe this problem, and could you help us run reproducible simulations so other users can obtain the same solutions? We attached the .jl file we are using to define the simulations.

Best regards, Filipe Pereira Luke van Roekel Amrapalli Garanaik Brodie Pearson

tec_ww_time_c1

c16_128_128m(1).jl.zip

simone-silvestri commented 2 years ago

I will try doing some digging. A first thing I notice (not sure that this will solve the problem) is that, looking at your script, it seems you are using a pretty outdated version of Oceananigans (something like 0.71 or lower). I would try running with an updated versions and see if the problem persists

fspereira1 commented 2 years ago

Sure, but how can I do that? Is there something on the script to change the Oceananigans version, or did you see that through the options in the script? My apologies if this question is basic, but I am a new user.

simone-silvestri commented 2 years ago

No problem, you have a line in the script that would not work with the current Oceananigans version: the keyword arguments of LinearEquationOfState (α, β) were renamed to thermal_expansion and haline_contraction in Oceananigans 0.71.5.

To update Oceananigans, you can type

import Pkg
Pkg.update("Oceananigans")

this should give you version 0.75.3 If you want to try with the latest version (0.77.5) you can do

Pkg.add(Pkg.PackageSpec(name = "Oceananigans", version = "0.77.5"))

If you are using a GPU, I would remain on 0.75.3 at the moment because in 0.77.5 there are problems with GradientBoundaryConditions on the GPU. Probably it will not fix your issue, but a lot has changed internally so it is worth the try (remember to change those keyword arguments before running, otherwise you will get an error)

fspereira1 commented 2 years ago

[17:11]fspereira@ch-fe2[/lustre/scratch5/fspereira/OCEANANIGANS/test/case1]# julia () | Documentation: https://docs.julialang.org () | () () | | |_ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ ` | | | | || | | | (| | | Version 1.6.7 (2022-07-19) / |_'|||_'_| | Official https://julialang.org/ release |/ |

julia> import Pkg

julia> Pkg.status("Oceananigans") Status ~/.julia/environments/v1.6/Project.toml [9e8cae18] Oceananigans v0.68.7

julia> Pkg.update("Oceananigans") Updating registry at ~/.julia/registries/General No Changes to ~/.julia/environments/v1.6/Project.toml No Changes to ~/.julia/environments/v1.6/Manifest.toml

julia> Pkg.status("Oceananigans") Status ~/.julia/environments/v1.6/Project.toml [9e8cae18] Oceananigans v0.68.7

It does not update. Do I need a different version of Julia? If I remember correctly, the website says that 1.6.7 is ok

simone-silvestri commented 2 years ago

It should be ok with Julia 1.6, try removing it with

Pkg.rm("Oceananigans")

if that doesn't work you can always force a version with

Pkg.add(Pkg.PackageSpec(name = "Oceananigans", version = "0.77.5"))
amrapallig commented 2 years ago

Thank you @simone-silvestri. We are using LESbrary.jl for calculating turbulent statistics as shown in the script (lines 288-300) shared by @fspereira1. Even with the latest version of Oceananigans.jl v0.76.1, while adding the LESbrary package, (Pkg.add(url="https://github.com/CliMA/LESbrary.jl.git")), the Oceananigan.jl version automatically changes to v0.68.7 that is stable with the LESbrary package. To run our script or any example script with v0.68.7, we have to change thermal_expansion and haline_contraction back to $\alpha$ and $\beta$. For a similar reason filename and filepath are also switched in the script.

Is there any other package for the estimation of turbulent statistics (second and third-order statistics along with sub-grid-scale fluxes) with the latest version of Oceananigans.jl instead of calculating within the script (commented lines 260-284)?

simone-silvestri commented 2 years ago

Ah I see. Well from what I see everything in the TurbulentStatistic.jl module is compatible with new versions of Oceananigans (except maybe GPU usage). Since that is what you are using, you can maybe use it locally?

This is just a quick fix to try out the new Oceananigans. Also, to try out if the problem persists it is enough to test some simple second-order moments. The ones you have in your script should do the job

u, v, w = model.velocities
t = model.tracers.T

U = Average(u, dims=(1, 2)) 
V = Average(v, dims=(1, 2))
T = Average(t, dims=(1, 2))
wu = Average(w * u, dims=(1, 2))
wv = Average(w * v, dims=(1, 2))
uu = Average(u * u, dims=(1, 2))
vv = Average(v * v, dims=(1, 2))
ww = Average(w * w, dims=(1, 2))
www = Average(w * w * w, dims=(1, 2))
wT = Average(w * t, dims=(1, 2))
uv = Average(u * v, dims=(1, 2))
uT = Average(u * t, dims=(1, 2))
vT = Average(v * t, dims=(1, 2))
TT = Average(t * t, dims=(1, 2))
fspereira1 commented 2 years ago

Hi All,

I rerun the simulations using the newest version of the code,

julia> [17:03]fspereira@ch-fe1[/lustre/scratch5/fspereira/OCEANANIGANS/test/case5]# julia () | Documentation: https://docs.julialang.org () | () () | | |_ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ ` | | | | || | | | (| | | Version 1.6.7 (2022-07-19) / |_'|||_'_| | Official https://julialang.org/ release |/ |

julia> import Pkg

julia> Pkg.status("Oceananigans") Status ~/.julia/environments/v1.6/Project.toml [9e8cae18] Oceananigans v0.77.5

and a script based on the one available on oceananigans webpage (I only changed the grid size, constant, and set the random seed. I also tried without these changes):

https://github.com/CliMA/Oceananigans.jl/blob/main/examples/ocean_wind_mixing_and_convection.jl

Unfortunately, the new code/script led to the same reproducibility problem. I ran 4 simulations using the same script (attached) and obtained 4 different average ww profiles.

tec_ww_time_c1

c16_128_128m.jl.zip

Any ideas or suggestions?

simone-silvestri commented 2 years ago

Thanks for the hassle of rerunning on the new version. Are the initial conditions exactly the same? (I guess yes since you fix the random seed, but it's good to have the certainty)

amrapallig commented 2 years ago

@simone-silvestri, yes, the initial conditions are exactly the same as in the script.

glwagner commented 2 years ago

I suspect the issue is your use of function-based boundary conditions. If you initialize with a function, we launch a kernel that computes the function at every grid cell. However, I don't think that the ordering of the loop is guaranteed to be deterministic. If so, this non-deterministic loop ordering is a feature of KernelAbstractions. Whether or not this is a CPU vs GPU issue, I'm not sure. But I don't think that we get deterministic thread ordering on the GPU without extra effort.

To obtain determininstic initial conditions, you could try using arrays to set the initial conditions instead. Your script contains these lines:

rng = MersenneTwister(1234)
Random.seed!(rng, 1414)
Ξ(z) = randn(rng) * z / model.grid.Lz * (1 + z / model.grid.Lz) 
Tᵢ(x, y, z) = 20 + dTdz * z + dTdz * model.grid.Lz * 1e-6 * Ξ(z)
uᵢ(x, y, z) = sqrt(abs(Qᵘ)) * 1e-3 * Ξ(z)
set!(model, u=uᵢ, w=uᵢ, T=Tᵢ, S=35)

You can try replacing these lines with something like

Random.seed!(1414)

T = model.tracers.T
u, v, w = model.velocities

x, y, z = nodes(T, reshape=true)
Lz = model.grid.Lz

shape = @. z / Lz * (1 + z / Lz)
ΞT = randn(size(T)...) *. shape
Ξu = randn(size(u)...) *. shape
Ξw = randn(size(w)...) *. shape

Tᵢ = @. 20 + dTdz * z + dTdz * Lz * 1e-6 * ΞT
uᵢ = @. sqrt(abs(Qᵘ)) * 1e-3 * Ξu
wᵢ = @. sqrt(abs(Qᵘ)) * 1e-3 * Ξw

set!(model, u=uᵢ, w=wᵢ, T=Tᵢ, S=35)

I'd be curious to know if this works.

Here are a few more tips and best practices for raising issues here:

fspereira1 commented 2 years ago

Thank you for all your comments. I will try those lines Just one question. Is the sintax of

ΞT = randn(size(T)...) *. shape

correct? I am getting this error message:

Warning: No xauth data; using fake authentication data for X11 forwarding. [NVBLAS] NVBLAS_CONFIG_FILE environment variable is NOT set : relying on default config filename 'nvblas.conf' [NVBLAS] Cannot open default config file 'nvblas.conf' [NVBLAS] Config parsed [NVBLAS] CPU Blas library need to be provided ┌ Warning: You appear to be using MPI.jl with the default MPI binary on a cluster. │ We recommend using the system-provided MPI, see the Configuration section of the MPI.jl docs. └ @ MPI ~/.julia/packages/MPI/08SPr/deps/deps.jl:15 [ Info: Oceananigans will use 16 threads ERROR: LoadError: syntax: invalid identifier name "." Stacktrace: [1] top-level scope @ /lustre/scratch5/.mdt0/fspereira/OCEANANIGANS/test/case09/c16_128_128m.jl:197 in expression starting at /lustre/scratch5/.mdt0/fspereira/OCEANANIGANS/test/case09/c16_128_128m.jl:197

Line 197 corresponds to the line above. I removed the *.shape and the simulations are now running. Is that ok?

glwagner commented 2 years ago

Oh sorry that was a typo (incorrect Julia syntax). It should be

ΞT = randn(size(T)...) .* shape
glwagner commented 2 years ago

You can probably also omit the shape component. That just reduces the noise to zero at the top and the bottom.

fspereira1 commented 2 years ago

Thank you. I will try it and let you know if it fixes the reproducibility problem.

fspereira1 commented 2 years ago

There is still a problem. This time with the vector size:

┌ Warning: You appear to be using MPI.jl with the default MPI binary on a cluster. │ We recommend using the system-provided MPI, see the Configuration section of the MPI.jl docs. └ @ MPI ~/.julia/packages/MPI/08SPr/deps/deps.jl:15 [NVBLAS] No Gpu available [NVBLAS] NVBLAS_CONFIG_FILE environment variable is NOT set : relying on default config filename 'nvblas.conf' [NVBLAS] Cannot open default config file 'nvblas.conf' [NVBLAS] Config parsed [NVBLAS] CPU Blas library need to be provided [ Info: Oceananigans will use 16 threads ERROR: LoadError: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 129 and 128") Stacktrace: [1] _bcs1 @ ./broadcast.jl:501 [inlined] [2] _bcs (repeats 3 times) @ ./broadcast.jl:495 [inlined] [3] broadcast_shape @ ./broadcast.jl:489 [inlined] [4] combine_axes @ ./broadcast.jl:484 [inlined] [5] instantiate @ ./broadcast.jl:266 [inlined] [6] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{3}, Nothing, typeof(*), Tuple{Array{Float64, 3}, Array{Float64, 3}}}) @ Base.Broadcast ./broadcast.jl:883 [7] top-level scope @ /lustre/scratch5/.mdt0/fspereira/OCEANANIGANS/test/case13/c16_128_128m.jl:200

Also, shouldn't we set v (last line below, I added the commented v's):

`Random.seed!(1414)

T = model.tracers.T u, v, w = model.velocities

x, y, z = nodes(T, reshape=true) Lz = model.grid.Lz

shape = @. z / Lz (1 + z / Lz) ΞT = randn(size(T)...) . shape Ξu = randn(size(u)...) .* shape

Ξv = randn(size(v)...) .* shape

Ξw = randn(size(w)...) .* shape

Tᵢ = @. 20 + dTdz z + dTdz Lz 1e-6 ΞT uᵢ = @. sqrt(abs(Qᵘ)) 1e-3 Ξu

vᵢ = @. sqrt(abs(Qᵘ)) 1e-3 Ξv

wᵢ = @. sqrt(abs(Qᵘ)) 1e-3 Ξw

set!(model, u=uᵢ, v=vᵢ, w=wᵢ, T=Tᵢ, S=35)

set!(model, u=uᵢ, w=wᵢ, T=Tᵢ, S=35)`

glwagner commented 2 years ago

It's probably with w; I forgot that w has a different size in the vertical direction. You can try changing the w initial condition with code like

xw, yw, zw = nodes(w, reshape=true)
wshape = @. zw / Lz * (1 + zw / Lz)
Ξw = randn(size(w)...) .* wshape

I'm not sure if this is where the error comes from. The stack trace / error message that you posted will tell you the specific line that is generating the error. Identifying the line that generates the error can be useful for debugging.

If you like, you can generate a minimal example, which I can then try to run to reproduce a bug (and also to debug my own code suggestions).

Also, shouldn't we set v (last line below, I added the commented v's):

You can. This is not a question about reproducibility though --- it depends on your application. Give it a shot and see how it changes your results.

Note that there will be non-zero v in the initial condition anyways, because it has to be projected onto an incompressible velocity field (which is not guaranteed by the random data we are using).

Also, please format your code with triple backticks:

https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code

You can add syntax annotation too appropriate for Julia.

fspereira1 commented 2 years ago

Thank you very much for your suggestions. These changes helped a lot, but the problem still remains. I ran 8 simulations until t=4h. 6 lead to similar results, and 2 to slightly different results. The last two led to equal results. Weird, no? Any ideas or suggestions?

Random.seed!(1414)

T = model.tracers.T
u, v, w = model.velocities

x, y, z = nodes(T, reshape=true)
Lz = model.grid.Lz

xw, yw, zw = nodes(w, reshape=true)

wshape = @. zw / Lz * (1 + zw / Lz)
shape = @. z / Lz * (1 + z / Lz)

ΞT = randn(size(T)...) .* shape
Ξu = randn(size(u)...) .* shape
Ξv = randn(size(v)...) .* shape
Ξw = randn(size(w)...) .* wshape

Tᵢ = @. 20 + dTdz * z + dTdz * Lz * 1e-6 * ΞT
uᵢ = @. sqrt(abs(Qᵘ)) * 1e-3 * Ξu
vᵢ = @. sqrt(abs(Qᵘ)) * 1e-3 * Ξv
wᵢ = @. sqrt(abs(Qᵘ)) * 1e-3 * Ξw

set!(model, u=uᵢ, w=wᵢ, T=Tᵢ, S=35)

tec_dww_time_c1 tec_wu_time_c1 tec_ww_time_c1 tec_www_time_c1

glwagner commented 2 years ago

Can you verify that the initial conditions are identical, before you run the simulations?

fspereira1 commented 2 years ago

My apologies for the late reply. Yes, the initial conditions seem identical:

u=0 v=0 w=0 T diff 0 (attached - 6 cases)

julia_IC.pdf

glwagner commented 2 years ago

Can you verify that the initial state of the model, after calling set!(model, ...), but before calling run!(simulation), are identical between the realizations? (Not just by reading the code.)

fspereira1 commented 2 years ago

Thank you for your answer. I am not sure if I understood your comment. The tests I did were obtained after set!(model,..), and before run. I wrote the T, u, v, and w fields. They were all identical Is this what you asked me to do?

glwagner commented 2 years ago

Okay, apologies. I just didn't quite understand what you meant when you said they seem identical. Typically we would just write something like T1 == T2, which will return true or false, or equivalently something like all(T1 .== T2). Another test is to use isapprox (also written ) as in all(T1 .≈ T2).

Here's a bit more background on the reproducibility tests we currently have:

We have "regression tests" that test to ensure that output from a certain simulation remains identical across PRs, including tests that involve LES closures. These tests involve ~10 time steps. We conclude that results are "identical" when every grid point is within sqrt(eps(T)), where T is the floating point type (eg Float64 or Float32), for example:

https://github.com/CliMA/Oceananigans.jl/blob/fc84215f76661e9f1cfb103dc18f86442cec9d89/test/regression_tests/hydrostatic_free_turbulence_regression_test.jl#L112

Many of our other tests also implicitly rely on reproducibility.

I think, therefore, that we do have reproduciblity in many cases. However, it is quite possible that your case exposes some particular feature that leads to a loss of reprodicibility. I think perhaps the next step in order to make progress is to code up a "minimal working example" (often called an MWE), which involves relentlessly simplifying the examle until we isolate the essential complication that leads to a failure of the test. With that knowledge in hand, we can dig deeper to find the underlying cause (and hopefully fix it). Often, the process of simplying a script in order to isolate the MWE also produces some insight about the issue (and potentially about the test).

fspereira1 commented 2 years ago

My bad. I was not clear. Yes, T1 - T2 = 0

glwagner commented 2 years ago

As pointed out by @kburns, if you're interested in bit reproducibility you may also need to set the FFTW plan (by default, FFTW is not reproducible even on identical architectures). I'm not sure this is your issue (I have doubts...) but if you want to be thorough you may want to check this. To set the plan you have to build the pressure solver manually with something like:

using Oceananigans.Solvers: FFTBasedPoissonSolver
using FFTW

pressure_solver = FFTBasedPoissonSolver(grid, planner_flag=FFTW.ESTIMATE)
model = NonhydrostaticModel(; grid, pressure_solver, other_kwargs...)

I'd be surprised if "round-off errors" accumulate enough to cause the differences you're seeing (but it does seem at least possible, especially for very long runs at relatively coarser resolutions where slight differences in the eddy diffusivity might lead to slightly different turbulent trajectories). Hope that helps.

fspereira1 commented 2 years ago

Thank you for the suggestion @glwagner

I installed the FFT, and I can

using Oceananigans.Solvers: FFTBasedPoissonSolver using FFTW

but when I try

pressure_solver = FFTBasedPoissonSolver(grid, planner_flag=FFTW.ESTIMATE)

I get

julia> pressure_solver = FFTBasedPoissonSolver(grid, planner_flag=FFTW.ESTIMATE) ERROR: MethodError: no method matching FFTBasedPoissonSolver(::RectilinearGrid{Float64, Periodic, Periodic, Bounded, Float64, Float64, OffsetArrays.OffsetVector{Float64, Vector{Float64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}}}, OffsetArrays.OffsetVector{Float64, Vector{Float64}}, CPU}; planner_flag=0x00000040) Closest candidates are: FFTBasedPoissonSolver(::Any) at /users/fspereira/.julia/packages/Oceananigans/7G5bN/src/Solvers/fft_based_poisson_solver.jl:50 got unsupported keyword argument "planner_flag" FFTBasedPoissonSolver(::Any, ::Any) at /users/fspereira/.julia/packages/Oceananigans/7G5bN/src/Solvers/fft_based_poisson_solver.jl:50 got unsupported keyword argument "planner_flag" FFTBasedPoissonSolver(::G, ::Λ, ::S, ::B, ::T) where {G, Λ, S, B, T} at /users/fspereira/.julia/packages/Oceananigans/7G5bN/src/Solvers/fft_based_poisson_solver.jl:6 got unsupported keyword argument "planner_flag" Stacktrace: [1] top-level scope @ REPL[42]:1 [2] top-level scope @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52

What should I use?

glwagner commented 2 years ago

Ah sorry. I think you should use

pressure_solver = FFTBasedPoissonSolver(grid, FFTW.ESTIMATE)

PS try triple backticks (``) rather than single backticks () for formatting blocks of code / error messages.

fspereira1 commented 2 years ago

Thanks, that worked. However, now I get this


Iteration: 0010, time: 1.833 minutes, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 1.577 minutes
Iteration: 0020, time: NaN years, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 1.674 minutes
Iteration: 0030, time: NaN years, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 1.766 minutes
Iteration: 0040, time: NaN years, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 1.841 minutes
Iteration: 0050, time: NaN years, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 1.942 minutes
Iteration: 0060, time: NaN years, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 2.018 minutes
Iteration: 0070, time: NaN years, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 2.111 minutes
Iteration: 0080, time: NaN years, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 2.186 minutes
Iteration: 0090, time: NaN years, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 2.261 minutes
Iteration: 0100, time: NaN years, Δt: NaN years, max(|w|) = NaN ms⁻¹, wall time: 2.349 minutes```
glwagner commented 2 years ago

Oh, is your grid vertically stretched? In that case you need to use

pressure_solver = FourierTridiagonalPoissonSolver(grid, FFTW.ESTIMATE)
fspereira1 commented 2 years ago

Yes, it is. I ran ten simulations (same .jl script) setting Tmax=4h, and I got the same profiles. It seems the reproducibility problem is solved. I'll run the simulations for four days (our goal) and let you know what happens. Thank you for your help. If this works, I can share the final script.

tec_dww_time_c1 tec_ww_time_c1 tec_www_time_c1

glwagner commented 2 years ago

Great! (Thanks @kburns)

glwagner commented 2 years ago

Feel free to close this issue @fspereira1 if you believe it is resolved.

fspereira1 commented 2 years ago

@glwagner and @kburns, thank you. I will confirm if the simulations running for four days work and then close it.

Since it is a different topic, I created a new issue https://github.com/CliMA/Oceananigans.jl/issues/2787 but I wonder how I could run this case on parallel (512x512x512 grid). I noticed you just replied.

glwagner commented 2 years ago

It might make sense to convert this to a discussion and change the title to "Building reproducible LES setups".

The info here could be useful for future Oceananigans users that would like to build reproducible setups (thanks for your efforts in this department @fspereira1).

Of note, the lessons learned here are mostly about achieving reproducibility with Julia and FFTW (the lessons are not Oceananigans specific, and are applicable to other Julia applications).

And to summarize the important points:

fspereira1 commented 2 years ago

@glwagner This Makes sense to me. And thank you for your help. Also, I want to give you an update. I ran ten simulations with the same .jl file with T=4days. I got the same profiles (ww, www, T, etc). I can share the .jl file, it might useful to other users. What would be the best way?

glwagner commented 2 years ago

Perhaps start a git repo and post a link to it? It's best to include the julia environment you're using with the file (otherwise it will go stale as Oceananigans evolves).

If you want to just post the file then you can use a gist, or copy/paste the code here if its short.