Open sumiya11 opened 7 months ago
Your problem is that you don't release the allocated arrays, and thus end up allocating a quite bit out of the buffer. If you correct this, you will get a little performance increase over the "normal" version
using BenchmarkTools, Bumper
function work0(polys; use_custom_allocator=false)
if use_custom_allocator
custom_allocator = Bumper.SlabBuffer()
work1(polys, custom_allocator)
else
work1(polys)
end
end
# Very important work
function work1(polys, custom_allocator)
res = 0
for poly in polys
@no_escape custom_allocator begin
new_poly = work2(poly, custom_allocator)
res += sum(new_poly)
end
end
res
end
function work1(polys)
res = 0
for poly in polys
new_poly = work2(poly)
res += sum(new_poly)
end
res
end
function work2(poly::Vector{T}) where {T}
new_poly = Vector{T}(undef, length(poly))
work3!(new_poly)
end
function work2(poly::Vector{T}, custom_allocator) where {T}
new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
work3!(new_poly)
end
function work3!(poly::AbstractVector{T}) where {T}
poly[1] = one(T)
for i in 2:length(poly)
poly[i] = convert(T, i)^3 - poly[i - 1]
end
poly
end
## #
m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];
@btime work0(polys, use_custom_allocator=false)
# 3.430 ms (10001 allocations: 20.20 MiB)
#0x0000e20ca5f991fb
@btime work0(polys, use_custom_allocator=true)
# 3.032 ms (4 allocations: 192 bytes)
#0x0000e20ca5f991fb
The code is not that well optimized, so it will give you only a very small performance improvement.
Your problem is that you don't release the allocated arrays, and thus end up allocating a quite bit out of the buffer.
But that is precisely the point !
The problem is that, when you don't release arrays that are out of scope, you end up using much more memory than you actually need. In this example your data should fit into L1 cache (little over 8kB). But, if you don't release the temporary data, you end up with a buffer that does not fit into cache at all (about 20MB).
edit. The default SlabBuffer is about 8MB, so it should fit in L3 cache.
Thanks for your answer, sorry for being not clear, let me explain.
My example is purposefully inefficient. In my original problem, releasing intermediate arrays cannot be done easily. The allocation pattern in my example here roughly resembles this.
Hi,
I have the following example where I observe a 2x slowdown with
Bumper.alloc!
.Could you please confirm that I use the package correctly? Do you have ideas on how to fix this?
Thank you !
Running on