JuliaGPU / Adapt.jl

Other
86 stars 24 forks source link

Use recursion to fix inference failure #78

Closed charleskawczynski closed 4 months ago

charleskawczynski commented 4 months ago

This PR inlines, and uses recursion to fix cases where Adapt.adapt performs runtime dispatch due to either recursion limits, or inability to unroll map on Tuples. I have and will post a reproducer shortly.

Closes #79.

charleskawczynski commented 4 months ago

I'm verifying (here) that this branch is solely responsible, but this seems to reduce gpu allocations for us by a factor of 15: From build https://buildkite.com/clima/climaatmos-ci/builds/17118#018e020d-d9a0-4f26-b8c7-04013be0b169

[ Info: `allocs (flame_gpu_implicit_barowave_moist)`: 3637320

to build https://buildkite.com/clima/climaatmos-ci/builds/17105#018e004b-661b-404c-96e9-fafa28d875ed:

[ Info: `allocs (flame_gpu_implicit_barowave_moist)`: 241464

This basically fixes runtime dispatch issues that we're seeing (mentioned here, but in a different context).

charleskawczynski commented 4 months ago

Here is a reproducer:

git clone https://github.com/CliMA/ClimaCore.jl
cd ClimaCore.jl/
julia --project=examples
import ClimaCore;
import ClimaComms;
using Test
import ClimaCore.Fields
import ClimaCore.Operators
import ClimaCore.Spaces
import ClimaCore.Geometry
using BenchmarkTools;
@isdefined(TU) || include(joinpath(pkgdir(ClimaCore), "test", "TestUtilities", "TestUtilities.jl"));
import .TestUtilities as TU;
FT = Float64;
zelem=25
cspace = TU.CenterExtrudedFiniteDifferenceSpace(FT;zelem, helem=10);
fspace = Spaces.FaceExtrudedFiniteDifferenceSpace(cspace);
const CT3 = Geometry.Contravariant3Vector
const C12 = ClimaCore.Geometry.Covariant12Vector
@show ClimaComms.device(cspace)
ᶜx = fill((;uₕ=zero(C12{FT}) ,ρ=FT(0)), cspace);
ᶠx = fill((;ᶠuₕ³=zero(CT3{FT})), fspace);

const ᶠwinterp = Operators.WeightedInterpolateC2F(
    bottom = Operators.Extrapolate(),
    top = Operators.Extrapolate(),
)

function set_ᶠuₕ³!(ᶜx, ᶠx)
    ᶜJ = Fields.local_geometry_field(ᶜx).J
    @. ᶠx.ᶠuₕ³ = ᶠwinterp(ᶜx.ρ * ᶜJ, CT3(ᶜx.uₕ))
    return nothing
end

set_ᶠuₕ³!(ᶜx, ᶠx) # compile
p_allocated = @allocated set_ᶠuₕ³!(ᶜx, ᶠx)
@show p_allocated

using CUDA
using JET
using Adapt
# @test_opt ignored_modules = (CUDA,) set_ᶠuₕ³!(ᶜx, ᶠx)
import Base.Broadcast: broadcasted
ᶜJ = Fields.local_geometry_field(ᶜx).J
bc = broadcasted(ᶠwinterp, broadcasted(*, ᶜx.ρ, ᶜJ), broadcasted(CT3, ᶜx.uₕ));
Adapt.adapt(CUDA.KernelAdaptor(), bc.args);
# Call chain:
@test_opt ignored_modules = (CUDA,) Adapt.adapt(CUDA.KernelAdaptor(), bc.args);
@test_opt ignored_modules = (CUDA,) Adapt.adapt_structure(CUDA.KernelAdaptor(), bc.args);
@test_opt map(Adapt.adapt(CUDA.KernelAdaptor()), bc.args);

. You may need to have JET in your local environment for this reproducer to work. The JET tests fail on main, and pass in this branch.

Any chance we can get this merged? cc @maleadt

charleskawczynski commented 4 months ago

Ok, an update on this: this PR is about half responsible for the allocations mentioned above. So, I'd really appreciate if we can merge this.

maleadt commented 4 months ago

Bluntly applying @inline everywhere is not particularly satisfying; most of the time the Julia's inliner is smart enough to do the right thing. Do you have an idea which methods specifically are problematic?

charleskawczynski commented 4 months ago

Bluntly applying @inline everywhere is not particularly satisfying; most of the time the Julia's inliner is smart enough to do the right thing. Do you have an idea which methods specifically are problematic?

It’s the call to map that is problematic. Should I try to only apply it to that method?

maleadt commented 4 months ago

Should I try to only apply it to that method?

If that resolves the problem, I think that would be a cleaner solution. The other methods should normally all be inlined already.

charleskawczynski commented 4 months ago

Ok, I've updated the PR. It turns out that @inline is not even necessary. So, I've only replaced map with recursion, which still fixes the issue. I've also updated the reproducer, which I think should "work" better (I tried it just now and ran into version incompatibilities).

charleskawczynski commented 4 months ago

I don't understand why, but it seems that the compiler is suddenly having issues with adapt_structure(to, xs::NamedTuple) = map(x->adapt(to,x), xs) once adapt_structure(to, xs::Tuple) = map(x->adapt(to,x), xs) is replaced with recursion. Changing adapt_structure(to, xs::NamedTuple) = map(x->adapt(to,x), xs) to

adapt_structure(to, xs::NamedTuple{names}) where {names} =
    NamedTuple{names}(adapt_structure(to, Tuple(xs)))

fixes things (and it still infers in my original example). I've pushed this fix. Is this alright @maleadt?

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 94.02%. Comparing base (83fc6c6) to head (9b91675).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #78 +/- ## ========================================== + Coverage 93.65% 94.02% +0.37% ========================================== Files 6 6 Lines 63 67 +4 ========================================== + Hits 59 63 +4 Misses 4 4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

maleadt commented 4 months ago

You used map(x->adapt(to,x), xs) while there used to be adapt(to) aka. Base.Fix1(adapt, to). Not sure why that makes a difference, but it fixes tests here.

charleskawczynski commented 4 months ago

~Interesting! This begs the question: does adapt_structure(to, xs::Union{Tuple,NamedTuple}) = map(x->adapt(to, x), xs) work more generally, and fix my original issue? I'll try that now.~ I misread that (I had it reversed). Ok, changes look good.