Jutho / TensorOperations.jl

Julia package for tensor contractions and related operations
https://jutho.github.io/TensorOperations.jl/stable/
Other
438 stars 55 forks source link

Manual allocation strategy #143

Open amilsted opened 11 months ago

amilsted commented 11 months ago

Regarding the discussion in #140, do you think the tensoralloc, tensorfree interface is compatible with approaches like this?

Could one alternatively write backend code that uses malloc and free to avoid GC? Would this require introducing a special array wrapper type?

Is it possible to guarantee that manually allocated intermediates get freed in case of an exception, within the current interface? Would this involve defining a finalizer for the array wrapper type?

lkdvos commented 11 months ago

As promised, I've started experimenting with some strategies, it is still very much WIP but you can follow it here: AllocationKit

So far I've tried just plain malloc/free, and the finalizer version which is less prone to memory leaks, and initial benchmarks on my laptop report the underlying times. The testcase is the application of a single-site update encountered for example in DMRG or VUMPS for an Ising-like hamiltonian, repeated 30 times. (benchmark here )

In summary, it seems as if just doing this already leads to various improvements, depending on the tensor sizes. I might continue playing around this week with something like Bumper.jl, allthough I ran into some issues trying to get the StrideArray/PtrArray to work nicely with Strided.jl, so I will need to fix that first.

D = 64:

julia> result["default"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
 Range (min … max):  400.062 μs … 715.187 μs  ┊ GC (min … max): 13.62% … 35.15%
 Time  (median):     509.756 μs               ┊ GC (median):    11.73%
 Time  (mean ± σ):   503.847 μs ±  51.078 μs  ┊ GC (mean ± σ):  13.61% ±  3.73%

      ▁                       ▂▂█▅▁▁       ▁                     
  ▃▂▄▄█▄▇▄▇▆▅▅▅▄▃▃▃▄▄▄▄▄▅▄▄▄▄▆██████▆▅█▇█▇▇█▅▆▅▅▅▄▄▁▃▃▄▂▃▃▁▃▃▃▃ ▄
  400 μs           Histogram: frequency by time          617 μs <

 Memory estimate: 1.63 MiB, allocs estimate: 38.

julia> result["malloc"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
 Range (min … max):  271.277 μs … 727.405 μs  ┊ GC (min … max):  0.00% … 37.25%
 Time  (median):     345.764 μs               ┊ GC (median):    13.97%
 Time  (mean ± σ):   360.340 μs ±  61.898 μs  ┊ GC (mean ± σ):   8.75% ±  6.93%

    ▁▄▅          ▃▄▆▃█▅ ▁ ▂▁                                ▃    
  ▃████▄▆▆█▆▅▆▄▄▇████████▇████▆▅▅▅▆▃▅▄▃▃▄██▇▅▄▄▁▃▁▁▁▁▁▁▁▁▄█▇█▇▃ ▅
  271 μs           Histogram: frequency by time          493 μs <

 Memory estimate: 899.11 KiB, allocs estimate: 39.

julia> result["safemalloc"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
 Range (min … max):  276.067 μs … 568.403 μs  ┊ GC (min … max):  0.00% … 37.13%
 Time  (median):     362.419 μs               ┊ GC (median):    13.79%
 Time  (mean ± σ):   365.961 μs ±  39.841 μs  ┊ GC (mean ± σ):   8.66% ±  6.82%

                      ▃▃▃▇▅▁ ▁             ▁▇█▇▃▃                
  ▃▅▄▄▆▄▄▂▃▃▂▁▅▃▃▂▃▃▅▆██████▇█▅▇▄▆▅▄▃▄▃▄▂▆████████▅▄▄▃▃▄▂▁▁▁▁▁▂ ▄
  276 μs           Histogram: frequency by time          448 μs <

 Memory estimate: 899.11 KiB, allocs estimate: 39.

D = 128

julia> result["default"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
 Range (min … max):  1.929 ms …   3.871 ms  ┊ GC (min … max): 11.19% … 6.77%
 Time  (median):     2.403 ms               ┊ GC (median):    10.51%
 Time  (mean ± σ):   2.416 ms ± 271.936 μs  ┊ GC (mean ± σ):  10.48% ± 1.48%

                  ▁ ▃▄▇█▅                                      
  ▄▅▅▅▄▄▄▃▄▃▃▃▃▃▄▆███████▇▆▅▅▄▄▅▅▃▃▂▃▂▃▃▂▁▃▃▃▂▁▃▃▃▃▃▂▃▂▂▃▁▂▁▃ ▃
  1.93 ms         Histogram: frequency by time        3.29 ms <

 Memory estimate: 6.50 MiB, allocs estimate: 38.

julia> result["malloc"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
 Range (min … max):  1.625 ms …   3.481 ms  ┊ GC (min … max): 5.90% … 3.47%
 Time  (median):     1.734 ms               ┊ GC (median):    6.02%
 Time  (mean ± σ):   1.886 ms ± 299.257 μs  ┊ GC (mean ± σ):  6.39% ± 1.41%

   ▂█▃▇▁                                                       
  ▅█████▆▄▃▄▅▄▄▃▃▄▃▃▃▃▃▂▃▂▂▃▃▃▃▃▃▃▃▂▃▃▃▃▃▃▄▂▂▂▂▂▂▁▃▂▃▂▂▂▂▁▁▁▂ ▃
  1.62 ms         Histogram: frequency by time        2.85 ms <

 Memory estimate: 3.50 MiB, allocs estimate: 39.

julia> result["safemalloc"][D]
BenchmarkTools.Trial: 500 samples with 30 evaluations.
 Range (min … max):  1.625 ms …   2.556 ms  ┊ GC (min … max): 6.02% … 7.28%
 Time  (median):     1.716 ms               ┊ GC (median):    6.02%
 Time  (mean ± σ):   1.761 ms ± 134.372 μs  ┊ GC (mean ± σ):  6.81% ± 1.37%

    ▂▄█▄▄▃▆▆                                                   
  ▃▄█████████▇▇▅▄▄▃▃▃▃▃▃▄▃▁▂▃▃▃▃▃▁▂▃▃▃▃▃▃▂▂▃▂▂▃▂▁▂▃▂▁▂▂▁▂▂▁▂▃ ▃
  1.63 ms         Histogram: frequency by time        2.26 ms <

 Memory estimate: 3.50 MiB, allocs estimate: 39.
amilsted commented 10 months ago

Nice! And cool that this is very simple to implement with the tensoralloc scheme.