mtfishman commented 3 years ago

This adds multithreading to block sparse contractions using Julia's native multithreading (Threads.@spawn). The basic interface is as follows. Start Julia with more than one thread with:

julia -t 4

or

JULIA_NUM_THREADS=4 julia

Then, enable block sparse multithreading with NDTensors.enable_threaded_blocksparse() (you should probably also turn off BLAS and Strided multithreading, which we could automate within NDTensors):

using ITensors
using LinearAlgebra

BLAS.set_num_threads(1)
ITensors.Strided.disable_threads()

@show Threads.nthreads()

N = 4
d = 1000
i = Index([QN(0, N) => d for n in 1:N], "i")

# Matrix multiplication
A = randomITensor(i'', dag(i'))
B = randomITensor(i', dag(i))

ITensors.disable_threaded_blocksparse()
@show ITensors.using_threaded_blocksparse()
C1 = @time A * B

ITensors.enable_threaded_blocksparse()
@show ITensors.using_threaded_blocksparse()
C2 = @time A * B
ITensors.disable_threaded_blocksparse()

@show C1 ≈ C2

returns:

Threads.nthreads() = 4
ITensors.using_threaded_blocksparse() = false
  2.204602 seconds (508 allocations: 122.121 MiB)
ITensors.using_threaded_blocksparse() = true
  0.688744 seconds (1.76 k allocations: 122.205 MiB)
C1 ≈ C2 = true

mtfishman commented 3 years ago

Results for the 6x3 2D Hubbard with momentum conservation:

With 1 thread:

use_splitblocks = false
nnz(H[end ÷ 2]) = 120
nnzblocks(H[end ÷ 2]) = 56
After sweep 1 energy=-5.861157015734 maxlinkdim=78 time=0.651
After sweep 2 energy=-12.397725588037 maxlinkdim=200 time=6.533
After sweep 3 energy=-13.472412477129 maxlinkdim=400 time=12.536
After sweep 4 energy=-13.627475223589 maxlinkdim=800 time=8.016
After sweep 5 energy=-13.693628527488 maxlinkdim=2000 time=14.626
After sweep 6 energy=-13.711928225390 maxlinkdim=3000 time=24.676
After sweep 7 energy=-13.715575192226 maxlinkdim=3000 time=28.102
After sweep 8 energy=-13.716394378223 maxlinkdim=3000 time=28.623
After sweep 9 energy=-13.716535094341 maxlinkdim=3000 time=27.619
After sweep 10 energy=-13.716556326664 maxlinkdim=3000 time=26.235
177.620945 seconds (659.95 M allocations: 281.461 GiB, 7.58% gc time)

With 2 threads:

use_splitblocks = false
nnz(H[end ÷ 2]) = 120
nnzblocks(H[end ÷ 2]) = 56
After sweep 1 energy=-5.861157015734 maxlinkdim=78 time=0.944
After sweep 2 energy=-12.397725588024 maxlinkdim=200 time=7.474
After sweep 3 energy=-13.472412477127 maxlinkdim=400 time=13.684
After sweep 4 energy=-13.627475223589 maxlinkdim=800 time=10.164
After sweep 5 energy=-13.693628527488 maxlinkdim=2000 time=14.541
After sweep 6 energy=-13.711928225391 maxlinkdim=3000 time=20.496
After sweep 7 energy=-13.715575192226 maxlinkdim=3000 time=23.069
After sweep 8 energy=-13.716394378223 maxlinkdim=3000 time=23.485
After sweep 9 energy=-13.716535094341 maxlinkdim=3000 time=23.540
After sweep 10 energy=-13.716556326664 maxlinkdim=3000 time=22.450
159.854112 seconds (1.58 G allocations: 356.487 GiB, 13.33% gc time)

With 5 threads:

use_splitblocks = false
nnz(H[end ÷ 2]) = 120
nnzblocks(H[end ÷ 2]) = 56
After sweep 1 energy=-5.861157015736 maxlinkdim=78 time=0.873
After sweep 2 energy=-12.397725587989 maxlinkdim=200 time=5.909
After sweep 3 energy=-13.472412477117 maxlinkdim=400 time=10.250
After sweep 4 energy=-13.627475223586 maxlinkdim=800 time=8.148
After sweep 5 energy=-13.693628527487 maxlinkdim=2000 time=12.126
After sweep 6 energy=-13.711928225391 maxlinkdim=3000 time=17.470
After sweep 7 energy=-13.715575192226 maxlinkdim=3000 time=18.624
After sweep 8 energy=-13.716394378223 maxlinkdim=3000 time=18.773
After sweep 9 energy=-13.716535094341 maxlinkdim=3000 time=18.649
After sweep 10 energy=-13.716556326664 maxlinkdim=3000 time=17.888
128.715973 seconds (1.58 G allocations: 357.182 GiB, 17.87% gc time)

The same calculation with C++ ITensor with 1 thread:

    vN Entropy at center bond b=9 = 3.561531540785
    Eigs at center bond b=9: 0.1118 0.1117 0.0949 0.0949 0.0544 0.0543 0.0543 0.0543 0.0246 0.0246
    Largest link dim during sweep 10/10 was 3000
    Largest truncation error: 2.27609e-07
    Energy after sweep 10/10 is -13.716559596412
    Sweep 10/10 CPU time = 39.08s (Wall time = 43.97s)

and C++ ITensor with 5 threads:

    vN Entropy at center bond b=9 = 3.561531540785
    Eigs at center bond b=9: 0.1118 0.1117 0.0949 0.0949 0.0544 0.0543 0.0543 0.0543 0.0246 0.0246
    Largest link dim during sweep 10/10 was 3000
    Largest truncation error: 2.27609e-07
    Energy after sweep 10/10 is -13.716559596412
    Sweep 10/10 CPU time = 1m, 42.4s (Wall time = 22.13s)

codecov-io commented 3 years ago

Codecov Report

Merging #59 (536631f) into master (4a6ab7d) will decrease coverage by 1.60%. The diff coverage is 18.82%.

@@            Coverage Diff             @@
##           master      #59      +/-   ##
==========================================
- Coverage   34.19%   32.58%   -1.61%     
==========================================
  Files          23       23              
  Lines        2372     2765     +393     
==========================================
+ Hits          811      901      +90     
- Misses       1561     1864     +303

Impacted Files	Coverage Δ
src/blocksparse/block.jl	`39.21% <0.00%> (+3.04%)`	:arrow_up:
src/blocksparse/combiner.jl	`0.00% <ø> (ø)`
src/blocksparse/linearalgebra.jl	`43.63% <ø> (-3.59%)`	:arrow_down:
src/diag.jl	`11.55% <ø> (+0.38%)`	:arrow_up:
src/linearalgebra.jl	`32.43% <ø> (+0.85%)`	:arrow_up:
src/NDTensors.jl	`25.00% <16.66%> (-17.86%)`	:arrow_down:
src/blocksparse/blocksparsetensor.jl	`41.52% <18.30%> (-5.86%)`	:arrow_down:
src/dense.jl	`42.92% <100.00%> (-2.48%)`	:arrow_down:
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4a6ab7d...536631f. Read the comment docs.

mtfishman commented 3 years ago

The timings are updated with a new version of NDTensors. You can see that the single threaded code is quite fast at this point, and the speedup from multithreading is not as pronounced compared to the C++ version. Getting a better speedup may require some more advanced multithreading techniques (for example handling the case of repeated contractions to the same block better). Speedups may be better for larger system sizes with more blocks.

Note that this multithreading doesn't seem to compose well with BLAS multithreading or Strided multithreading. In principle, it should compose with Strided multithreading, since they are both using Julia's native @spawn framework, but that needs to be investigated. A question that arises is if we should somehow automatically turn of BLAS and Strided multithreading when NDTensors multithreading is being used. In principle it would be nice, but in practice it becomes a question of where to turn it off (i.e. globally or locally only where our threading is being used).

Anyway, I think this is good to go for now, and we can iterate on it as needed.

emstoudenmire commented 3 years ago

It is already giving quite a worthwhile speedup, so then I think it's good to go and we could always iterate on it more. The main thing I wanted to see is that the defaults make sense (e.g. you still have to opt in to this even if running Julia in a multithreaded environment, which is what we want for now). So then in the docs we can guide people through the best practices like turning off BLAS multithreading etc.

One minor question is: I see timings for 5 threads above. What's the speedup over single-threaded (including single-threaded BLAS and Strided) when just using two threads, say?

mtfishman commented 3 years ago

It is already giving quite a worthwhile speedup, so then I think it's good to go and we could always iterate on it more. The main thing I wanted to see is that the defaults make sense (e.g. you still have to opt in to this even if running Julia in a multithreaded environment, which is what we want for now). So then in the docs we can guide people through the best practices like turning off BLAS multithreading etc.

Sounds good. Indeed, block sparse threading is off by default even if Julia is started with multiple threads.

One minor question is: I see timings for 5 threads above. What's the speedup over single-threaded (including single-threaded BLAS and Strided) when just using two threads, say?

Good question, I'll check.

mtfishman commented 3 years ago

I've updated the second comment with results for 2 threads. I also rewrote the loop with partitioning, which improved the results a bit for 5 threads.

For future reference, here is a comparison between different ways to multithread a for loop with native Julia multithreading:

using BenchmarkTools

function f1!(y, x)
  for n in eachindex(x)
    @inbounds y[n] = sin(x[n])
  end
  return y
end

function f2!(y, x)
  Threads.@threads for n in eachindex(x)
    @inbounds y[n] = sin(x[n])
  end
  return y
end

function f3!(y, x)
  @sync for n in eachindex(x)
    Threads.@spawn begin
      @inbounds y[n] = sin(x[n])
    end
  end
  return y
end

function f4!(y, x)
  @sync for n_partition in Iterators.partition(eachindex(x), max(1, length(eachindex(x)) ÷ Threads.nthreads()))
    Threads.@spawn for n in n_partition
      @inbounds y[n] = sin(x[n])
    end
  end
  return y
end

N = 20_000
x = randn(N)
y = similar(x)
@btime f1!($y, $x)
@btime f2!($y, $x)
@btime f3!($y, $x)
@btime f4!($y, $x)

where with 4 threads I get:

  187.558 μs (0 allocations: 0 bytes)
  47.332 μs (21 allocations: 2.98 KiB)
  8.680 ms (140025 allocations: 15.45 MiB)
  51.871 μs (36 allocations: 3.56 KiB)

You can see that f2! and f4! get nearly perfect speedup, while f3! gets no speedup (and is in fact worse than not threading). f2! uses the higher level Threads.@threads macro that automatically splits up the for loop into chunks, and those chunks are then sent to different threads. In f3!, each time the inner part of the loop is being called, it is spawning a new piece of work for Julia to allocate to a thread. However, right now, this has quite a bit of overhead:

julia> x = 1;

julia> @btime sin($x)
  11.959 ns (0 allocations: 0 bytes)
0.8414709848078965

julia> @btime fetch(Threads.@spawn sin($x))
  1.703 μs (7 allocations: 785 bytes)
0.8414709848078965

so this is much slower. To fix this, in f4!, first we split up the outer for loop into partitions of work, and those partitions are spawned as chunks of work for Julia to distribute to threads, so Threads.@spawn is only called once per thread and the overhead is minimized. Notice that @sync is necessary around the for loop to indicate to Julia that we want to wait for all of the work to be finished before moving on from the for loop: https://stackoverflow.com/questions/61905127/what-is-the-difference-between-threads-spawn-and-threads-threads

While Threads.@threads is easier to use for simple threading of a for loop, my understanding is that it doesn't compose well with Threads.@spawn, so if within the loop we want to make use of other multithreading (like for example Strided, BLAS, etc.) then it is best to use Threads.@spawn.

Essentially, the latest change was rewriting the loop from being written like f3! to being written like f4! (though there was already some level of partitioning being done, so the difference is not so drastic as in this toy example).

emstoudenmire commented 3 years ago

Thanks Matt. These comments and findings will be useful to refer back to as we do multithreading in other places. It will be interesting to see how composable the multithreading is if, say, we multithread over multiple MPS in a sum, while simultaneously multithreading over QNITensor blocks.

ITensor / NDTensors.jl

Threaded blocksparse #59

Codecov Report