Closed charleskawczynski closed 4 months ago
I'm verifying (here) that this branch is solely responsible, but this seems to reduce gpu allocations for us by a factor of 15: From build https://buildkite.com/clima/climaatmos-ci/builds/17118#018e020d-d9a0-4f26-b8c7-04013be0b169
[ Info: `allocs (flame_gpu_implicit_barowave_moist)`: 3637320
to build https://buildkite.com/clima/climaatmos-ci/builds/17105#018e004b-661b-404c-96e9-fafa28d875ed:
[ Info: `allocs (flame_gpu_implicit_barowave_moist)`: 241464
This basically fixes runtime dispatch issues that we're seeing (mentioned here, but in a different context).
Here is a reproducer:
git clone https://github.com/CliMA/ClimaCore.jl
cd ClimaCore.jl/
julia --project=examples
import ClimaCore;
import ClimaComms;
using Test
import ClimaCore.Fields
import ClimaCore.Operators
import ClimaCore.Spaces
import ClimaCore.Geometry
using BenchmarkTools;
@isdefined(TU) || include(joinpath(pkgdir(ClimaCore), "test", "TestUtilities", "TestUtilities.jl"));
import .TestUtilities as TU;
FT = Float64;
zelem=25
cspace = TU.CenterExtrudedFiniteDifferenceSpace(FT;zelem, helem=10);
fspace = Spaces.FaceExtrudedFiniteDifferenceSpace(cspace);
const CT3 = Geometry.Contravariant3Vector
const C12 = ClimaCore.Geometry.Covariant12Vector
@show ClimaComms.device(cspace)
ᶜx = fill((;uₕ=zero(C12{FT}) ,ρ=FT(0)), cspace);
ᶠx = fill((;ᶠuₕ³=zero(CT3{FT})), fspace);
const ᶠwinterp = Operators.WeightedInterpolateC2F(
bottom = Operators.Extrapolate(),
top = Operators.Extrapolate(),
)
function set_ᶠuₕ³!(ᶜx, ᶠx)
ᶜJ = Fields.local_geometry_field(ᶜx).J
@. ᶠx.ᶠuₕ³ = ᶠwinterp(ᶜx.ρ * ᶜJ, CT3(ᶜx.uₕ))
return nothing
end
set_ᶠuₕ³!(ᶜx, ᶠx) # compile
p_allocated = @allocated set_ᶠuₕ³!(ᶜx, ᶠx)
@show p_allocated
using CUDA
using JET
using Adapt
# @test_opt ignored_modules = (CUDA,) set_ᶠuₕ³!(ᶜx, ᶠx)
import Base.Broadcast: broadcasted
ᶜJ = Fields.local_geometry_field(ᶜx).J
bc = broadcasted(ᶠwinterp, broadcasted(*, ᶜx.ρ, ᶜJ), broadcasted(CT3, ᶜx.uₕ));
Adapt.adapt(CUDA.KernelAdaptor(), bc.args);
# Call chain:
@test_opt ignored_modules = (CUDA,) Adapt.adapt(CUDA.KernelAdaptor(), bc.args);
@test_opt ignored_modules = (CUDA,) Adapt.adapt_structure(CUDA.KernelAdaptor(), bc.args);
@test_opt map(Adapt.adapt(CUDA.KernelAdaptor()), bc.args);
. You may need to have JET
in your local environment for this reproducer to work. The JET tests fail on main, and pass in this branch.
Any chance we can get this merged? cc @maleadt
Ok, an update on this: this PR is about half responsible for the allocations mentioned above. So, I'd really appreciate if we can merge this.
Bluntly applying @inline
everywhere is not particularly satisfying; most of the time the Julia's inliner is smart enough to do the right thing. Do you have an idea which methods specifically are problematic?
Bluntly applying
@inline
everywhere is not particularly satisfying; most of the time the Julia's inliner is smart enough to do the right thing. Do you have an idea which methods specifically are problematic?
It’s the call to map
that is problematic. Should I try to only apply it to that method?
Should I try to only apply it to that method?
If that resolves the problem, I think that would be a cleaner solution. The other methods should normally all be inlined already.
Ok, I've updated the PR. It turns out that @inline
is not even necessary. So, I've only replaced map
with recursion, which still fixes the issue. I've also updated the reproducer, which I think should "work" better (I tried it just now and ran into version incompatibilities).
I don't understand why, but it seems that the compiler is suddenly having issues with adapt_structure(to, xs::NamedTuple) = map(x->adapt(to,x), xs)
once adapt_structure(to, xs::Tuple) = map(x->adapt(to,x), xs)
is replaced with recursion. Changing adapt_structure(to, xs::NamedTuple) = map(x->adapt(to,x), xs)
to
adapt_structure(to, xs::NamedTuple{names}) where {names} =
NamedTuple{names}(adapt_structure(to, Tuple(xs)))
fixes things (and it still infers in my original example). I've pushed this fix. Is this alright @maleadt?
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 94.02%. Comparing base (
83fc6c6
) to head (9b91675
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
You used map(x->adapt(to,x), xs)
while there used to be adapt(to)
aka. Base.Fix1(adapt, to)
. Not sure why that makes a difference, but it fixes tests here.
~Interesting! This begs the question: does adapt_structure(to, xs::Union{Tuple,NamedTuple}) = map(x->adapt(to, x), xs)
work more generally, and fix my original issue? I'll try that now.~ I misread that (I had it reversed). Ok, changes look good.
This PR inlines, and uses recursion to fix cases where
Adapt.adapt
performs runtime dispatch due to either recursion limits, or inability to unrollmap
onTuple
s. I have and will post a reproducer shortly.Closes #79.