Closed weymouth closed 2 months ago
I would also like to update the optimization example in this PR, but I think the updated solver is ready for review.
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 94.29%. Comparing base (
36dc237
) to head (d451d2b
).:exclamation: Current head d451d2b differs from pull request most recent head 924ed2c
Please upload reports for the commit 924ed2c to get more accurate results.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I had a look at it and all seems fine to me. But maybe @marinlauber can also double-check it.
About improving the inner product, what are you thinking of? Isn't just the problem that the noise of Float32
is at the level of the divergence-free criterion? Wouldn't the casting to Float64
help here?
We can't cast because we don't do this with a working array, like df
in pressure force
. PCG is already using every single scalar field in the code. We could add another for this purpose and see if it helps...
I meant to cast within a custom kernel. I will play with it and see if that works.
Give it a try.
On Tue, Jun 25, 2024, 18:32 Bernat Font @.***> wrote:
I meant to cast within a custom kernel. I will play with it and see if that works.
— Reply to this email directly, view it on GitHub https://github.com/weymouth/WaterLily.jl/pull/134#issuecomment-2189416634, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADSKJ6OEAKEOTLALOVQIBLZJGLSLAVCNFSM6AAAAABJYK4L2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGQYTMNRTGQ . You are receiving this because you authored the thread.Message ID: @.***>
Made a test to compare a custom dot product using mapreduce
vs LinearAlgebra.⋅
. It performs well on GPU, but the allocation size scales with array size N
on CPU. This is not coming from the type promotion.
using CUDA
using LinearAlgebra: ⋅
function dot_mapreduce64(a, b)
mapreduce(+, a, b) do x, y
promote_type(Float64,eltype(a))(x*y)
end
end
N = (100,1000,1000)
v = rand(Float32,N)
v_gpu = v |> CuArray
println(v⋅v)
println(v_gpu⋅v_gpu)
println(dot_mapreduce64(v,v))
println(dot_mapreduce64(v_gpu,v_gpu))
@time v⋅v
CUDA.@time v_gpu⋅v_gpu
@time dot_mapreduce64(v,v)
CUDA.@time dot_mapreduce64(v_gpu,v_gpu)
333715.56
333717.56
333717.5831831272
333717.5831831272
0.000246 seconds (1 allocation: 16 bytes)
0.000065 seconds (15 CPU allocations: 240 bytes)
0.001476 seconds (3 allocations: 7.629 MiB)
0.000368 seconds (163 CPU allocations: 5.016 KiB) (2 GPU allocations: 200 bytes, 10.66% memmgmt time)
This can't be a problem unique to us. What is the difference between the LinearAlgebra routine and ours?
For the CPU arrays, LinearAlgebra has a simple method of dot
:
https://github.com/JuliaLang/julia/blob/4e1fd72042133f0dfdcc3f4c3ce5fa74f76bb9c5/stdlib/LinearAlgebra/src/generic.jl#L957
But for the GPU arrays, they just offer an interface to CUBLAS.
Last resort, we could implement a @simd
version of this in CPU and extend this in the CUDA extension file with the propose method. Also, I have asked in Slack to see if someone comes up with an explanation for the allocations of mapreduce
in CPU.
This does not work on GPU for me, rightfully complains about indexing.
I suggest including the following in utils.jl
function ⋅(a, b)
@assert size(a) == size(b) "`size(a)` and `size(b)` are not matching."
s = zero(promote_type(Float64,eltype(a)))
@simd for i ∈ eachindex(a)
@inbounds s += a[i]*b[i]
end
s
end
which is as fast and non-allocating (just a double float) as LinearAlgebra: ⋅
. Then we can reimplement it in the CUDA and AMD extensions with
function ⋅(a, b)
@assert size(a) == size(b) "`size(a)` and `size(b)` are not matching."
mapreduce(+, a, b) do x, y
promote_type(Float64,eltype(a))(x*y)
end
end
This one, on GPU, is 3 times slower than LinearAlgebra: ⋅
(which uses CUBLAS for CuArrays), but allows us to reduce using Float64
. Thoughts @weymouth ?
I have found a few dynamic body cases (including the new Optim example) where
solver!
diverges (or nearly diverges) when usingT=Float32
(orDual{Float32}
). I think this is because the inner product within PCG is not super accurate when using Float32, especially on a deep MG level with small residuals, so you can end up adding noise instead of making progress.A good solution would be to improve the inner product function. That's not trivial, but it would be a good thing to look into.
What I've done instead is stabilize the solver a little by checking if the preconditioned residual
r'*D^{-1}*r
is increasing instead of decreasing (as it should). This is a good indication that somethings going wrong, and you should stop working at this level. I've set a tolerance of 10% increase for GMG smoothing, and 100% for pure PCG solving.I also created an extra parameter
ml.maxlevels
to limit the number of levels in a Vcycle, avoiding recreating levels when incrementing down.