Tensor contractions and overenthusiastic garbage collection

amilsted commented 6 years ago

Hello! I've been using Julia quite a bit for theoretical physics simulations. In particular, I do a lot of work with tensor networks, which are networks of tensors that must be contracted to compute a desired physical result. For tensor contractions, I use TensorOperations.jl by @Jutho.

I've noticed that some of my code triggers the garbage collector a lot. There seem to be a lot of allocations, a few of which are quite large, and for some reason this makes the GC run very often. Turning off the GC temporarily or preallocating memory where possible dramatically improves performance, but the former seems hacky and the latter is ugly. The clean code, which allocates arrays for intermediate results is:

using TensorOperations

function applyTM_MPO_r!{T}(TMout::Array{T,6}, A::Array{T,3}, B::Array{T,3}, o::Array{T,4}, TM2::Array{T,6})::Array{T,6}
    @tensor TMout[k1,m1,b1, k3,m3,b3] = conj(B[b1, t1, b2]) * (o[m1,s1,m2,t1] * (A[k1, s1, k2] * TM2[k2,m2,b2, k3,m3,b3]))
    TMout
end

function TMMs!(TMout, A, O, TM, itr)
    for j in 1:itr
        TMout = applyTM_MPO_r!(TMout, A, A, O, TM)
        TM, TMout = (TMout, TM)
    end
    TM
end

I have compared the performance of TMMs!() with versions that (i) pre-allocate a "work" vector which is then cut up to hold the intermediate results, (ii) allocate intermediate arrays using malloc() to avoid triggering the GC, (iii) turn off the GC for the tensor contraction then turn it back on (thanks @vtjnash!). Here are some BenchmarkTools results on Julia 0.6.0:

Time per itr     GC time per itr
0.05889150   0.00000000  Malloc
0.06375425   0.00065295      Pre-allocated intermediate arrays (work vector)
0.09281390   0.00544290      GC off (~6% GC)
0.12625435   0.05836750      Unmodified (~40% GC)
~0.06        unknown     Equivalent python code (scipy, no preallocation, with MKL BLAS)

Note: Number of BLAS threads set to 1 in all cases. Julia using OpenBLAS. 
Note2: These results depend on the dimensions of the arrays involved. The dimensions I have chosen are not unrealistic.

I can't help feeling Julia could do better on the pretty code ("unmodified"). Heavy optimization is often not worth the extra debugging effort for scientific code (unless it will be used heavily), and the current results make me hesitate when recommending Julia for research on Tensor Network algorithms, despite the lovely syntax provided by TensorOperations, and all the other reasons I think Julia is awesome.

I was hoping perhaps somebody has an idea for how to modify the GC heuristics to better cope with cases like this. The number of allocations hardly differs between the approaches above (around 150 per iteration), so it really seems like the combination of many smaller allocations and some large ones makes the GC hyperactive.

I have attached a Jupyter notebook with the benchmark: julia_tensor_GC_example.zip

NazerkeT commented 2 years ago

Hi Ashley and the Julia Community! Sorry for disturbing you after so many years of this issue! Were there any thoughts or updates about this issue?

I was thinking about a possible quarter project on Runtime Systems with Julia and this issue looks like a possible interesting challenge to solve :) Is there anyone I can approach about this topic? Or are there any other ideas within a community about RunTime systems to which I might contribute within a couple of months (quarter length)?

Thanks, Best wishes,

oscardssmith commented 2 years ago

There are known issues with Julia's GC, but unless you already have some experience writing GC code (preferably in C), it is probably not within reach as this type of project (unless you're ready to put in a ton of time). If you're looking for projects, I would recommend asking on Slack or Discourse, both of which are more likely to get community response.

NazerkeT commented 2 years ago

Got it, thanks @oscardssmith!

JuliaLang / julia

Tensor contractions and overenthusiastic garbage collection #24545