Open amilsted opened 11 months ago
As promised, I've started experimenting with some strategies, it is still very much WIP but you can follow it here: AllocationKit
So far I've tried just plain malloc/free, and the finalizer version which is less prone to memory leaks, and initial benchmarks on my laptop report the underlying times. The testcase is the application of a single-site update encountered for example in DMRG or VUMPS for an Ising-like hamiltonian, repeated 30 times. (benchmark here )
In summary, it seems as if just doing this already leads to various improvements, depending on the tensor sizes. I might continue playing around this week with something like Bumper.jl, allthough I ran into some issues trying to get the StrideArray
/PtrArray
to work nicely with Strided.jl, so I will need to fix that first.
D = 64:
julia> result["default"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
Range (min … max): 400.062 μs … 715.187 μs ┊ GC (min … max): 13.62% … 35.15%
Time (median): 509.756 μs ┊ GC (median): 11.73%
Time (mean ± σ): 503.847 μs ± 51.078 μs ┊ GC (mean ± σ): 13.61% ± 3.73%
▁ ▂▂█▅▁▁ ▁
▃▂▄▄█▄▇▄▇▆▅▅▅▄▃▃▃▄▄▄▄▄▅▄▄▄▄▆██████▆▅█▇█▇▇█▅▆▅▅▅▄▄▁▃▃▄▂▃▃▁▃▃▃▃ ▄
400 μs Histogram: frequency by time 617 μs <
Memory estimate: 1.63 MiB, allocs estimate: 38.
julia> result["malloc"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
Range (min … max): 271.277 μs … 727.405 μs ┊ GC (min … max): 0.00% … 37.25%
Time (median): 345.764 μs ┊ GC (median): 13.97%
Time (mean ± σ): 360.340 μs ± 61.898 μs ┊ GC (mean ± σ): 8.75% ± 6.93%
▁▄▅ ▃▄▆▃█▅ ▁ ▂▁ ▃
▃████▄▆▆█▆▅▆▄▄▇████████▇████▆▅▅▅▆▃▅▄▃▃▄██▇▅▄▄▁▃▁▁▁▁▁▁▁▁▄█▇█▇▃ ▅
271 μs Histogram: frequency by time 493 μs <
Memory estimate: 899.11 KiB, allocs estimate: 39.
julia> result["safemalloc"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
Range (min … max): 276.067 μs … 568.403 μs ┊ GC (min … max): 0.00% … 37.13%
Time (median): 362.419 μs ┊ GC (median): 13.79%
Time (mean ± σ): 365.961 μs ± 39.841 μs ┊ GC (mean ± σ): 8.66% ± 6.82%
▃▃▃▇▅▁ ▁ ▁▇█▇▃▃
▃▅▄▄▆▄▄▂▃▃▂▁▅▃▃▂▃▃▅▆██████▇█▅▇▄▆▅▄▃▄▃▄▂▆████████▅▄▄▃▃▄▂▁▁▁▁▁▂ ▄
276 μs Histogram: frequency by time 448 μs <
Memory estimate: 899.11 KiB, allocs estimate: 39.
D = 128
julia> result["default"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
Range (min … max): 1.929 ms … 3.871 ms ┊ GC (min … max): 11.19% … 6.77%
Time (median): 2.403 ms ┊ GC (median): 10.51%
Time (mean ± σ): 2.416 ms ± 271.936 μs ┊ GC (mean ± σ): 10.48% ± 1.48%
▁ ▃▄▇█▅
▄▅▅▅▄▄▄▃▄▃▃▃▃▃▄▆███████▇▆▅▅▄▄▅▅▃▃▂▃▂▃▃▂▁▃▃▃▂▁▃▃▃▃▃▂▃▂▂▃▁▂▁▃ ▃
1.93 ms Histogram: frequency by time 3.29 ms <
Memory estimate: 6.50 MiB, allocs estimate: 38.
julia> result["malloc"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
Range (min … max): 1.625 ms … 3.481 ms ┊ GC (min … max): 5.90% … 3.47%
Time (median): 1.734 ms ┊ GC (median): 6.02%
Time (mean ± σ): 1.886 ms ± 299.257 μs ┊ GC (mean ± σ): 6.39% ± 1.41%
▂█▃▇▁
▅█████▆▄▃▄▅▄▄▃▃▄▃▃▃▃▃▂▃▂▂▃▃▃▃▃▃▃▃▂▃▃▃▃▃▃▄▂▂▂▂▂▂▁▃▂▃▂▂▂▂▁▁▁▂ ▃
1.62 ms Histogram: frequency by time 2.85 ms <
Memory estimate: 3.50 MiB, allocs estimate: 39.
julia> result["safemalloc"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
Range (min … max): 1.625 ms … 2.556 ms ┊ GC (min … max): 6.02% … 7.28%
Time (median): 1.716 ms ┊ GC (median): 6.02%
Time (mean ± σ): 1.761 ms ± 134.372 μs ┊ GC (mean ± σ): 6.81% ± 1.37%
▂▄█▄▄▃▆▆
▃▄█████████▇▇▅▄▄▃▃▃▃▃▃▄▃▁▂▃▃▃▃▃▁▂▃▃▃▃▃▃▂▂▃▂▂▃▂▁▂▃▂▁▂▂▁▂▂▁▂▃ ▃
1.63 ms Histogram: frequency by time 2.26 ms <
Memory estimate: 3.50 MiB, allocs estimate: 39.
Nice! And cool that this is very simple to implement with the tensoralloc
scheme.
Regarding the discussion in #140, do you think the
tensoralloc
,tensorfree
interface is compatible with approaches like this?Could one alternatively write backend code that uses
malloc
andfree
to avoid GC? Would this require introducing a special array wrapper type?Is it possible to guarantee that manually allocated intermediates get freed in case of an exception, within the current interface? Would this involve defining a finalizer for the array wrapper type?