CliMA / ClimaCore.jl

CliMA model dycore
https://clima.github.io/ClimaCore.jl/dev
Apache License 2.0
87 stars 8 forks source link

`eltype` occurs at runtime in `getidx` for at least some MatrixField operations #1871

Open charleskawczynski opened 4 months ago

charleskawczynski commented 4 months ago

I was inspired by some performance helpdesk discussion over the weekend (xref https://github.com/JuliaLang/julia/issues/55009), and made a very simple reproducer for our matrix-field getidx. I recently had a hunch that we should hoist our eltype calls, but @dennisYatunin pointed out that this should simply return a compile-time constant. The benchmark shows, however, that we're spending most of the time in eltype:

#=
julia --project
using Revise; include(joinpath("test", "Operators", "finitedifference", "getidx.jl"))
=#
using ClimaComms
ClimaComms.@import_required_backends
import BenchmarkTools
import ClimaCore
@isdefined(TU) || include(
    joinpath(pkgdir(ClimaCore), "test", "TestUtilities", "TestUtilities.jl"),
);
include(joinpath(pkgdir(ClimaCore), "test", "MatrixFields", "matrix_fields_broadcasting", "test_scalar_utils.jl"))
import .TestUtilities as TU;
using Test
using JET

import ClimaCore.MatrixFields: ⋅
import ClimaCore: Utilities, Spaces, Fields, Operators
import LazyBroadcast: @lazy

function call_getidx(space, bc, loc, idx, hidx)
    @inbounds Operators.getidx(space, bc, loc, idx, hidx)
    return nothing
end

device = ClimaComms.device()
space = TU.CenterExtrudedFiniteDifferenceSpace(
    FT;
    zelem = 30,
    helem = 4,
    context = ClimaComms.context(device),
)
bc = @lazy @. ᶠᶜmat ⋅ ᶜᶠmat ⋅
    (2 * ᶠᶜmat ⋅ ᶜᶜmat ⋅ ᶜᶠmat + ᶠᶠmat ⋅ ᶠᶠmat / 3 - (4I,)) ⋅ ᶠᶠmat ⋅
    (ᶠᶜmat ⋅ ᶜᶜmat ⋅ ᶜᶠmat * 2 - (ᶠᶠmat / 3) ⋅ ᶠᶠmat + (4I,)) ⋅ ᶠᶠmat;
loc = Operators.Interior()
idx = 10
idx = Utilities.PlusHalf(10)
hidx = (1, 1, 1)
call_getidx(space, bc, loc, idx, hidx)
BenchmarkTools.@benchmark call_getidx(space, bc, loc, idx, hidx)

import Profile, ProfileCanvas

function do_work(space, bc, loc, idx, hidx, n)
    for i in 1:n
        call_getidx(space, bc, loc, idx, hidx)
    end
    return nothing
end

do_work(space, bc, loc, idx, hidx, 1)
Profile.clear()
prof = Profile.@profile do_work(space, bc, loc, idx, hidx, 10^6)
results = Profile.fetch()
Profile.clear()
ProfileCanvas.html_file("flame.html", results)
Screenshot 2024-07-08 at 9 41 00 AM

Does this mean that the compiler is not caching the inference result? @vchuravy

The good news is that: 1) there is clearly room for improving our emitted code, and 2) this is a really nice way to make reproducers that captures the fully complexity of getidx while only targeting a single point in space, and 3) this impacts both the CPU and the GPU, since these instructions could easily increase register usage on the gpu.

charleskawczynski commented 4 months ago

1880 helped a lot, so I'll rename the title.

charleskawczynski commented 4 months ago

Here's the flame graph for latest version:

Screenshot 2024-07-11 at 2 31 05 PM