MasonProtter / Bumper.jl

Bring Your Own Stack
MIT License
152 stars 6 forks source link

Slowdown when using `alloc!` #33

Open sumiya11 opened 7 months ago

sumiya11 commented 7 months ago

Hi,

I have the following example where I observe a 2x slowdown with Bumper.alloc!.

Could you please confirm that I use the package correctly? Do you have ideas on how to fix this?

Thank you !

using BenchmarkTools, Bumper

function work0(polys; use_custom_allocator=false)
    if use_custom_allocator
        custom_allocator = Bumper.SlabBuffer()
        @no_escape custom_allocator begin
            work1(polys, custom_allocator)
        end
    else
        work1(polys)
    end
end

# Very important work
function work1(polys, custom_allocator=nothing)
    res = 0
    for poly in polys
        new_poly = work2(poly, custom_allocator)
        res += sum(new_poly)
    end
    res
end

function work2(poly::Vector{T}, ::Nothing) where {T}
    new_poly = Vector{T}(undef, length(poly))
    work3!(new_poly)
end

function work2(poly::Vector{T}, custom_allocator) where {T}
    new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
    work3!(new_poly)
end

function work3!(poly::AbstractVector{T}) where {T}
    poly[1] = one(T)
    for i in 2:length(poly)
        poly[i] = convert(T, i)^3 - poly[i - 1]
    end
    poly
end

###

m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];

@btime work0(polys, use_custom_allocator=false)
#   6.461 ms (10001 allocations: 20.26 MiB)
# 0x0000e2e1c67cdb19

@btime work0(polys, use_custom_allocator=true)
#   14.154 ms (6 allocations: 608 bytes)
# 0x0000e2e1c67cdb19

Running on

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC) 
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
tjjarvinen commented 6 months ago

Your problem is that you don't release the allocated arrays, and thus end up allocating a quite bit out of the buffer. If you correct this, you will get a little performance increase over the "normal" version

using BenchmarkTools, Bumper

function work0(polys; use_custom_allocator=false)
    if use_custom_allocator
        custom_allocator = Bumper.SlabBuffer()
        work1(polys, custom_allocator)
    else
        work1(polys)
    end
end

# Very important work
function work1(polys, custom_allocator)
    res = 0
    for poly in polys
        @no_escape custom_allocator begin
            new_poly = work2(poly, custom_allocator)
            res += sum(new_poly)
        end
    end
    res
end

function work1(polys)
    res = 0
    for poly in polys
        new_poly = work2(poly)
        res += sum(new_poly)
    end
    res
end

function work2(poly::Vector{T}) where {T}
    new_poly = Vector{T}(undef, length(poly))
    work3!(new_poly)
end

function work2(poly::Vector{T}, custom_allocator) where {T}
    new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
    work3!(new_poly)
end

function work3!(poly::AbstractVector{T}) where {T}
    poly[1] = one(T)
    for i in 2:length(poly)
        poly[i] = convert(T, i)^3 - poly[i - 1]
    end
    poly
end

## #

m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];

@btime work0(polys, use_custom_allocator=false)
#   3.430 ms (10001 allocations: 20.20 MiB)
#0x0000e20ca5f991fb

@btime work0(polys, use_custom_allocator=true)
#  3.032 ms (4 allocations: 192 bytes)
#0x0000e20ca5f991fb

The code is not that well optimized, so it will give you only a very small performance improvement.

sumiya11 commented 6 months ago

Your problem is that you don't release the allocated arrays, and thus end up allocating a quite bit out of the buffer.

But that is precisely the point !

tjjarvinen commented 6 months ago

The problem is that, when you don't release arrays that are out of scope, you end up using much more memory than you actually need. In this example your data should fit into L1 cache (little over 8kB). But, if you don't release the temporary data, you end up with a buffer that does not fit into cache at all (about 20MB).

edit. The default SlabBuffer is about 8MB, so it should fit in L3 cache.

sumiya11 commented 6 months ago

Thanks for your answer, sorry for being not clear, let me explain.

My example is purposefully inefficient. In my original problem, releasing intermediate arrays cannot be done easily. The allocation pattern in my example here roughly resembles this.