astro-group-bristol / Gradus.jl

Extensible spacetime agnostic general relativistic ray-tracing (GRRT).
https://astro-group-bristol.github.io/Gradus.jl/dev/
GNU General Public License v3.0
18 stars 2 forks source link

Reduce allocations in rendering #10

Closed fjebaker closed 2 years ago

fjebaker commented 2 years ago

Tracking allocations and analyzing with Coverage.jl shows that we allocate enormously during problem setup:

 CoverageTools.MallocInfo(0, "dev/Gradus/src/metrics/boyer-lindquist-fo.jl.101246.mem", 335)
 CoverageTools.MallocInfo(736, "dev/Gradus/src/Rendering/render.jl.101246.mem", 80)
 CoverageTools.MallocInfo(736, "dev/Gradus/src/Rendering/render.jl.101246.mem", 81)
 CoverageTools.MallocInfo(6528, "dev/Gradus/src/Rendering/render.jl.101246.mem", 77)
 CoverageTools.MallocInfo(7360, "dev/Gradus/src/GeodesicTracer/constraints.jl.101246.mem", 29)

The matrix dispatch already has inplace modifications, but for e.g. vectors of static vectors:

https://github.com/astro-group-bristol/Gradus.jl/blob/4356a1abcdca6acdf05b58f4c96ad95cf6cc8b61/src/GeodesicTracer/constraints.jl#L26-L30

There is no need to use map here, we should be able to re-use the same container.

fjebaker commented 2 years ago

A small update:

using Gradus, StaticArrays
m = BoyerLindquistAD(M=1.0, a=0.998)

u = @SVector [0.0, 1000.0, deg2rad(90), 0.0]

img = @time rendergeodesics(
    m, u, 2000.0; 
    abstol=1e-9, reltol=1e-9, 
    image_width=200 * 4, 
    image_height=200 * 4, 
    fov_factor=13.0 * 4,
    #verbose = false,
)

# 12 threads:
# 55.465356 seconds (1.47 G allocations: 51.276 GiB, 33.66% gc time, 4.16% compilation time)
# without gc 37s

# after some alloc hunting
# 51.923469 seconds (1.47 G allocations: 48.570 GiB, 35.03% gc time, 0.01% compilation time)

# 24 threads:
# 41.287283 seconds (1.47 G allocations: 48.598 GiB, 52.45% gc time)

We're losing as much as 50+% of our render time just to GC. This needs to be optimized.

fjebaker commented 2 years ago

I rewrote the renderer (PR pending) and the GC is just insane:

using Gradus, StaticArrays
m = BoyerLindquistAD(M=1.0, a=0.998)

u = @SVector [0.0, 1000.0, deg2rad(90), 0.0]

img = @time rendergeodesics(
    m, u, 2000.0; 
    abstol=1e-5, reltol=1e-5,
    image_width=200 * 4 + 1, 
    image_height=200 * 4 + 1, 
    fov_factor=13.0 * 4,
    verbose = true
)

# + Starting trace...
#   8.556839 seconds (454.59 M allocations: 20.855 GiB)
# + Trace complete.
#   8.562259 seconds (454.59 M allocations: 20.859 GiB)

# + Starting trace...
#  12.083763 seconds (454.59 M allocations: 20.855 GiB, 31.28% gc time)
# + Trace complete.
#  12.090315 seconds (454.59 M allocations: 20.859 GiB, 31.26% gc time)

# + Starting trace...
#  25.039618 seconds (454.60 M allocations: 20.855 GiB, 53.35% gc time)
# + Trace complete.
#  25.043922 seconds (454.60 M allocations: 20.859 GiB, 53.34% gc time)

# julia> GC.gc()

# + Starting trace...
#   7.484149 seconds (455.62 M allocations: 20.903 GiB)
# + Trace complete.
#   7.492415 seconds (455.62 M allocations: 20.908 GiB)

If the GC isn't hit, we outperform the hand-written elliptic curve integrator YNOGK which benchmarked at 12 seconds for the identical problem on the same machine (Typhon, running 12 / 128 threads). If the GC is hit, we can be nearly twice as slow.

fjebaker commented 2 years ago

I've been able to get this right down:

using Gradus, StaticArrays
m = BoyerLindquistAD(M=1.0, a=0.998)

u = @SVector [0.0, 1000.0, deg2rad(90), 0.0]

img = @time rendergeodesics(
    m, u, 2000.0; 
    abstol=1e-9, reltol=1e-9,
    image_width=200 * 4, 
    image_height=200 * 4, 
    fov_factor=13.0 * 4,
    verbose = true
)

# + Starting trace...
# Rendering: 100%[========================================] Time: 0:00:14
# + Trace complete.
#  14.672502 seconds (15.57 M allocations: 5.439 GiB)

Note that is at 1e-9 accuracy! For 1e-5:

# + Starting trace...
# Rendering: 100%[========================================] Time: 0:00:04
# + Trace complete.
#   4.442164 seconds (15.60 M allocations: 5.440 GiB)

I noticed, when using SecondOrderODEProblem instead of ODEProblem (either because of the array tooling used under the hood or otherwise) the allocations are far greater. I've converted Gradus.jl to use ODEProblems, which also have the added bonus that position is u[1:4] for both first and second order now.

This also puts us in a better position for #8 should we ever want to revist that.