WaterLily-jl / WaterLily.jl

Fast and simple fluid simulator in Julia
Other
630 stars 77 forks source link

Fix solver! when T=Float32 #134

Closed weymouth closed 2 months ago

weymouth commented 3 months ago

I have found a few dynamic body cases (including the new Optim example) where solver! diverges (or nearly diverges) when using T=Float32 (or Dual{Float32}). I think this is because the inner product within PCG is not super accurate when using Float32, especially on a deep MG level with small residuals, so you can end up adding noise instead of making progress.

A good solution would be to improve the inner product function. That's not trivial, but it would be a good thing to look into.

What I've done instead is stabilize the solver a little by checking if the preconditioned residual r'*D^{-1}*r is increasing instead of decreasing (as it should). This is a good indication that somethings going wrong, and you should stop working at this level. I've set a tolerance of 10% increase for GMG smoothing, and 100% for pure PCG solving.

I also created an extra parameter ml.maxlevels to limit the number of levels in a Vcycle, avoiding recreating levels when incrementing down.

weymouth commented 3 months ago

I would also like to update the optimization example in this PR, but I think the updated solver is ready for review.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 94.29%. Comparing base (36dc237) to head (d451d2b).

:exclamation: Current head d451d2b differs from pull request most recent head 924ed2c

Please upload reports for the commit 924ed2c to get more accurate results.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #134 +/- ## ======================================= Coverage 94.29% 94.29% ======================================= Files 12 12 Lines 526 526 ======================================= Hits 496 496 Misses 30 30 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

b-fg commented 3 months ago

I had a look at it and all seems fine to me. But maybe @marinlauber can also double-check it. About improving the inner product, what are you thinking of? Isn't just the problem that the noise of Float32 is at the level of the divergence-free criterion? Wouldn't the casting to Float64 help here?

weymouth commented 3 months ago

We can't cast because we don't do this with a working array, like df in pressure force. PCG is already using every single scalar field in the code. We could add another for this purpose and see if it helps...

b-fg commented 3 months ago

I meant to cast within a custom kernel. I will play with it and see if that works.

weymouth commented 3 months ago

Give it a try.

On Tue, Jun 25, 2024, 18:32 Bernat Font @.***> wrote:

I meant to cast within a custom kernel. I will play with it and see if that works.

— Reply to this email directly, view it on GitHub https://github.com/weymouth/WaterLily.jl/pull/134#issuecomment-2189416634, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADSKJ6OEAKEOTLALOVQIBLZJGLSLAVCNFSM6AAAAABJYK4L2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGQYTMNRTGQ . You are receiving this because you authored the thread.Message ID: @.***>

b-fg commented 3 months ago

Made a test to compare a custom dot product using mapreduce vs LinearAlgebra.⋅. It performs well on GPU, but the allocation size scales with array size N on CPU. This is not coming from the type promotion.

using CUDA
using LinearAlgebra: ⋅

function dot_mapreduce64(a, b)
    mapreduce(+, a, b) do x, y
        promote_type(Float64,eltype(a))(x*y)
    end
end

N = (100,1000,1000)
v = rand(Float32,N)
v_gpu = v |> CuArray
println(v⋅v)
println(v_gpu⋅v_gpu)
println(dot_mapreduce64(v,v))
println(dot_mapreduce64(v_gpu,v_gpu))

@time v⋅v
CUDA.@time v_gpu⋅v_gpu
@time dot_mapreduce64(v,v)
CUDA.@time dot_mapreduce64(v_gpu,v_gpu)
333715.56
333717.56
333717.5831831272
333717.5831831272

0.000246 seconds (1 allocation: 16 bytes)
0.000065 seconds (15 CPU allocations: 240 bytes)
0.001476 seconds (3 allocations: 7.629 MiB)
0.000368 seconds (163 CPU allocations: 5.016 KiB) (2 GPU allocations: 200 bytes, 10.66% memmgmt time)
weymouth commented 3 months ago

This can't be a problem unique to us. What is the difference between the LinearAlgebra routine and ours?

b-fg commented 3 months ago

For the CPU arrays, LinearAlgebra has a simple method of dot: https://github.com/JuliaLang/julia/blob/4e1fd72042133f0dfdcc3f4c3ce5fa74f76bb9c5/stdlib/LinearAlgebra/src/generic.jl#L957 But for the GPU arrays, they just offer an interface to CUBLAS.

Last resort, we could implement a @simd version of this in CPU and extend this in the CUDA extension file with the propose method. Also, I have asked in Slack to see if someone comes up with an explanation for the allocations of mapreduce in CPU.

b-fg commented 3 months ago

This does not work on GPU for me, rightfully complains about indexing.

b-fg commented 3 months ago

I suggest including the following in utils.jl

function ⋅(a, b)
    @assert size(a) == size(b) "`size(a)` and `size(b)` are not matching."
    s = zero(promote_type(Float64,eltype(a)))
    @simd for i ∈ eachindex(a)
        @inbounds s += a[i]*b[i]
    end
    s
end

which is as fast and non-allocating (just a double float) as LinearAlgebra: ⋅. Then we can reimplement it in the CUDA and AMD extensions with

function ⋅(a, b)
    @assert size(a) == size(b) "`size(a)` and `size(b)` are not matching."
    mapreduce(+, a, b) do x, y
        promote_type(Float64,eltype(a))(x*y)
    end
end

This one, on GPU, is 3 times slower than LinearAlgebra: ⋅ (which uses CUBLAS for CuArrays), but allows us to reduce using Float64. Thoughts @weymouth ?