overhead of each task still seems large (much better than @threads tho)

Roger-luo commented 3 years ago

so I managed to implement the multithreaded version of subspace_mm: https://gist.github.com/Roger-luo/0748a53b58c55e4187b545632917e54a

It is indeed now faster than the single thread version now (and much faster than Threads.@threads), however, when I looked into the benchmark of each task, it seems they still have some overheads (roughly 50% of each task's running time), e.g I can manually just run one or a few of the tasks created (my thread pool is of size 12)

initialization

```julia T = Float64 n = 20 S = rand(ComplexF64, 100, 1<

one can comment out some of the tasks inside GC.@preserve to see how overheads scale, I'll post the 6 thread scaling here, when I spawn 6 tasks, the total time is 2x of a single task, I guess this is due to the overhead while spawning the task to the thread pool, but it looks strange to me why it scales so significantly, wondering if @chriselrod knows any clue about this overhead, is it due to some sort of cache clash from the implementation itself?

1 launch_subspace_mm:

details

```julia julia> @benchmark GC.@preserve S U_re U_im subspace_ref begin launch_subspace_mm(1, S, indices, U_re, U_im, subspace_ref, f1, l1) # launch_subspace_mm(2, S, indices, U_re, U_im, subspace_ref, f2, l2) # launch_subspace_mm(3, S, indices, U_re, U_im, subspace_ref, f3, l3) # launch_subspace_mm(4, S, indices, U_re, U_im, subspace_ref, f4, l4) # launch_subspace_mm(5, S, indices, U_re, U_im, subspace_ref, f5, l5) # launch_subspace_mm(6, S, indices, U_re, U_im, subspace_ref, f6, l6) ThreadingUtilities.wait(1) # ThreadingUtilities.wait(2) # ThreadingUtilities.wait(3) # ThreadingUtilities.wait(4) # ThreadingUtilities.wait(5) # ThreadingUtilities.wait(6) end BenchmarkTools.Trial: memory estimate: 48 bytes allocs estimate: 1 -------------- minimum time: 56.152 ms (0.00% GC) median time: 56.464 ms (0.00% GC) mean time: 56.552 ms (0.00% GC) maximum time: 60.565 ms (0.00% GC) -------------- samples: 89 evals/sample: 1 ```

2 launch_subspace_mm:

details

```julia julia> @benchmark GC.@preserve S U_re U_im subspace_ref begin launch_subspace_mm(1, S, indices, U_re, U_im, subspace_ref, f1, l1) launch_subspace_mm(2, S, indices, U_re, U_im, subspace_ref, f2, l2) # launch_subspace_mm(3, S, indices, U_re, U_im, subspace_ref, f3, l3) # launch_subspace_mm(4, S, indices, U_re, U_im, subspace_ref, f4, l4) # launch_subspace_mm(5, S, indices, U_re, U_im, subspace_ref, f5, l5) # launch_subspace_mm(6, S, indices, U_re, U_im, subspace_ref, f6, l6) ThreadingUtilities.wait(1) ThreadingUtilities.wait(2) # ThreadingUtilities.wait(3) # ThreadingUtilities.wait(4) # ThreadingUtilities.wait(5) # ThreadingUtilities.wait(6) end BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 2 -------------- minimum time: 60.429 ms (0.00% GC) median time: 61.072 ms (0.00% GC) mean time: 61.196 ms (0.00% GC) maximum time: 65.528 ms (0.00% GC) -------------- samples: 82 evals/sample: 1 ```

4 launch_subspace_mm:

details

```julia julia> @benchmark GC.@preserve S U_re U_im subspace_ref begin launch_subspace_mm(1, S, indices, U_re, U_im, subspace_ref, f1, l1) launch_subspace_mm(2, S, indices, U_re, U_im, subspace_ref, f2, l2) launch_subspace_mm(3, S, indices, U_re, U_im, subspace_ref, f3, l3) launch_subspace_mm(4, S, indices, U_re, U_im, subspace_ref, f4, l4) # launch_subspace_mm(5, S, indices, U_re, U_im, subspace_ref, f5, l5) # launch_subspace_mm(6, S, indices, U_re, U_im, subspace_ref, f6, l6) ThreadingUtilities.wait(1) ThreadingUtilities.wait(2) ThreadingUtilities.wait(3) ThreadingUtilities.wait(4) # ThreadingUtilities.wait(5) # ThreadingUtilities.wait(6) end BenchmarkTools.Trial: memory estimate: 192 bytes allocs estimate: 4 -------------- minimum time: 85.298 ms (0.00% GC) median time: 86.338 ms (0.00% GC) mean time: 86.908 ms (0.00% GC) maximum time: 95.323 ms (0.00% GC) -------------- samples: 58 evals/sample: 1 ```

6 launch_subspace_mm:

details

```julia ulia> @benchmark GC.@preserve S U_re U_im subspace_ref begin launch_subspace_mm(1, S, indices, U_re, U_im, subspace_ref, f1, l1) launch_subspace_mm(2, S, indices, U_re, U_im, subspace_ref, f2, l2) launch_subspace_mm(3, S, indices, U_re, U_im, subspace_ref, f3, l3) launch_subspace_mm(4, S, indices, U_re, U_im, subspace_ref, f4, l4) launch_subspace_mm(5, S, indices, U_re, U_im, subspace_ref, f5, l5) launch_subspace_mm(6, S, indices, U_re, U_im, subspace_ref, f6, l6) ThreadingUtilities.wait(1) ThreadingUtilities.wait(2) ThreadingUtilities.wait(3) ThreadingUtilities.wait(4) ThreadingUtilities.wait(5) ThreadingUtilities.wait(6) end BenchmarkTools.Trial: memory estimate: 288 bytes allocs estimate: 6 -------------- minimum time: 122.614 ms (0.00% GC) median time: 123.734 ms (0.00% GC) mean time: 123.999 ms (0.00% GC) maximum time: 126.589 ms (0.00% GC) -------------- samples: 41 evals/sample: 1 ```

chriselrod commented 3 years ago

Note that I recently added CheapThreads.@batch, which can hopefully mostly act as a replacement for Threads.@threads.

Also, it is best to always run something on the main thread. Octavian/LoopVectorizaiton/CheapThreads do this. Aside from reducing communication between threads, it also reduces the amount of time spent waiting.

However, this is much more overhead than I'd expect.

Are you on a laptop with high single core boost? Some laptops may run single core workloads at twice the clock speed as multicore loads.

Also, what versions are you on? Running the script that defines threaded core:

julia> S1 ≈ S2
false

julia> S1[1:4,1:10]
4×10 Matrix{ComplexF64}:
  0.621745+4.88554im  -0.637415+4.86602im    0.966151+4.41336im   1.40568+4.63927im  -0.687298+3.65228im  …  0.455211+2.39196im  -0.355057+3.92583im  0.164448+3.8021im   -0.485497+3.66684im
   2.61722+4.09381im    1.23201+4.36167im  -0.0416561+3.66658im  0.840804+3.6682im    0.615881+3.37482im     0.214487+2.81938im   -1.07522+4.55756im  0.564601+4.8245im    0.992446+5.34931im
  0.571976+3.74502im  -0.288892+3.80724im     2.63142+4.83839im   1.95263+5.42144im  -0.323291+2.60659im      1.06235+3.65216im  -0.114007+5.34375im    1.0716+4.68024im   0.833657+4.09841im
 -0.465078+4.02098im  -0.627064+3.16224im     1.62749+4.02861im   1.40008+3.22084im   -0.70015+2.37163im     0.123393+2.22905im  -0.573632+3.79263im  0.334877+4.70496im  0.0900574+5.5306im

julia> S2[1:4,1:10]
4×10 Matrix{ComplexF64}:
  0.621745+4.88554im  -0.637415+4.86602im   1.59779+4.59373im   1.38767+4.27823im  -0.687298+3.65228im  …  0.245179+2.86638im  -0.664461+4.20201im  0.164448+3.8021im   -0.485497+3.66684im
   2.61722+4.09381im    1.23201+4.36167im   1.55172+2.97997im   1.47209+3.18286im   0.615881+3.37482im     0.750522+2.02171im  -0.115648+3.1445im   0.564601+4.8245im    0.992446+5.34931im
  0.571976+3.74502im  -0.288892+3.80724im   1.35182+3.99964im  0.250933+4.18534im  -0.323291+2.60659im     0.310087+2.83659im  -0.911839+3.84227im    1.0716+4.68024im   0.833657+4.09841im
 -0.465078+4.02098im  -0.627064+3.16224im  0.738949+4.73946im   0.68401+4.92848im   -0.70015+2.37163im     0.329456+3.28575im   -1.14074+5.03729im  0.334877+4.70496im  0.0900574+5.5306im

(bcqe) pkg> st
      Status `~/Documents/progwork/julia/env/bcqe/Project.toml`
  [4fba245c] ArrayInterface v3.1.6
  [29e2bfda] BQCESubroutine v0.1.0 `https://github.com/QuantumBFS/BQCESubroutine.jl.git#master`
  [b630d9fa] CheapThreads v0.1.6 `~/.julia/dev/CheapThreads`
  [bdcacae8] LoopVectorization v0.12.12 `~/.julia/dev/LoopVectorization`
  [476501e8] SLEEFPirates v0.6.15 `~/.julia/dev/SLEEFPirates`
  [d1fa6d79] StrideArrays v0.1.4 `~/.julia/dev/StrideArrays`
  [7792a7ef] StrideArraysCore v0.1.4 `~/.julia/dev/StrideArraysCore`
  [8290d209] ThreadingUtilities v0.4.1 `~/.julia/dev/ThreadingUtilities`
  [3d5dd08c] VectorizationBase v0.19.25 `~/.julia/dev/VectorizationBase`
  [66df03fb] YaoLocations v0.1.3

My results do not match.

That aside,

@inline function run_subspace_mul!(S::AbstractArray{T,3}, indices, U_re, U_im, subspace, offset, thread_range, ::StaticInt{D}) where {T<:Base.HWReal,D}
    C_re = StrideArray{T}(undef, (StaticInt{D}(), StaticInt{8}()))
    C_im = StrideArray{T}(undef, (StaticInt{D}(), StaticInt{8}()))
    S_size = ArrayInterface.size(S)
    for s in thread_range
        subspace_mul_kernel!(S, C_re, C_im, indices, U_re, U_im, s, subspace, Int(S_size[2]), offset)
    end
    return
end
function (k::SubspaceMatrixMul{T, D})(p::Ptr{UInt}) where {T, D}
    P = Tuple{Ptr{T}, Ptr{T}, Ptr{T}, Ptr{Int}, Ptr{BitSubspace}, Tuple{Int, Int}, UnitRange{Int}, Int}
    _, (S_ptr, U_re_ptr, U_im_ptr, indices_ptr, subspace_ptr, S_size, thread_range, offset) =
        ThreadingUtilities.load(p, P, 5*sizeof(UInt))

    S = StrideArray(PtrArray(S_ptr, (StaticInt{2}(), S_size...)))
    U_re = StrideArray(PtrArray(U_re_ptr, (D, D)))
    U_im = StrideArray(PtrArray(U_im_ptr, (D, D)))
    indices = StrideArray(PtrArray(indices_ptr, (D, )))
    subspace = Base.unsafe_load(subspace_ptr)

    run_subspace_mul!(S, indices, U_re, U_im, subspace, offset, thread_range, D)
    return
end
function variable_number(S::Matrix{Complex{T}}, indices, U_re, U_im, subspace_ref, ls) where {T}
    GC.@preserve S indices U_re U_im subspace_ref begin
        offset = 0
        prev = 0
        for i ∈ 1:length(ls)-1
            launch_subspace_mm(i, S, indices, U_re, U_im, subspace_ref, prev+1, ls[i], offset)
            prev = ls[i]
        end
        Sptr = Base.unsafe_convert(Ptr{T}, S)
        S3 = StrideArray(PtrArray(Sptr, (StaticInt(2), size(S)...)), S)
        run_subspace_mul!(
            S3, indices, U_re, U_im, subspace_ref[], offset, prev+1:ls[end], ArrayInterface.static_length(indices)
        )
        for i ∈ 1:length(ls)-1
            ThreadingUtilities.wait(i)
        end
    end
end

Now, benchmarking on my 4 core laptop after disabling turbo:

#! /bin/bash
echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

(reenable via echo "0" | ..., how to do this will depend on the OS, drivers, etc) I get

julia> ls
(32768, 65536, 98304, 131072)

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:1]))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     71.584 ms (0.00% GC)
  median time:      72.211 ms (0.00% GC)
  mean time:        72.212 ms (0.00% GC)
  maximum time:     72.749 ms (0.00% GC)
  --------------
  samples:          70
  evals/sample:     1

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:2]))
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     71.199 ms (0.00% GC)
  median time:      71.312 ms (0.00% GC)
  mean time:        71.331 ms (0.00% GC)
  maximum time:     71.584 ms (0.00% GC)
  --------------
  samples:          71
  evals/sample:     1

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:3]))
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  2
  --------------
  minimum time:     72.358 ms (0.00% GC)
  median time:      72.482 ms (0.00% GC)
  mean time:        72.518 ms (0.00% GC)
  maximum time:     73.623 ms (0.00% GC)
  --------------
  samples:          69
  evals/sample:     1

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:4]))
BenchmarkTools.Trial:
  memory estimate:  144 bytes
  allocs estimate:  3
  --------------
  minimum time:     74.850 ms (0.00% GC)
  median time:      74.969 ms (0.00% GC)
  mean time:        75.084 ms (0.00% GC)
  maximum time:     78.253 ms (0.00% GC)
  --------------
  samples:          67
  evals/sample:     1

chriselrod commented 3 years ago

The allocation is coming from:

julia> p = Base.unsafe_convert(Ptr{BitSubspace}, subspace_ref)
Ptr{BitSubspace} @0x00007efa7be1ceb0

julia> unsafe_load(p)
131072-element BitSubspace:
 00000000000000000000
 00000000000000000010
 ⋮
 11111111111111101000
 11111111111111101010

julia> @benchmark unsafe_load($p)
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     12.102 ns (0.00% GC)
  median time:      12.722 ns (0.00% GC)
  mean time:        17.232 ns (21.91% GC)
  maximum time:     4.324 μs (99.19% GC)
  --------------
  samples:          10000
  evals/sample:     999

If you'd rather avoid the allocation, you could

julia> subspace_ref = Ref(subspace)
Base.RefValue{BitSubspace}(BitSubspace(20, 131072, 3, [0, 3, 15], [1, 1, 1]))

julia> p = Base.unsafe_convert(Ptr{BitSubspace}, subspace_ref)
Ptr{BitSubspace} @0x00007f1d686d4ee0

julia> ccall(:jl_value_ptr, Ref{Base.RefValue{BitSubspace}}, (Ptr{Cvoid},), p)
Base.RefValue{BitSubspace}(BitSubspace(20, 131072, 3, [0, 3, 15], [1, 1, 1]))

julia> ccall(:jl_value_ptr, Ref{Base.RefValue{BitSubspace}}, (Ptr{Cvoid},), p)[]
131072-element BitSubspace:
 00000000000000000000
 00000000000000000010
 ⋮
 11111111111111101000
 11111111111111101010

julia> @benchmark ccall(:jl_value_ptr, Ref{Base.RefValue{BitSubspace}}, (Ptr{Cvoid},), $p)[]
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.818 ns (0.00% GC)
  median time:      1.830 ns (0.00% GC)
  mean time:        1.868 ns (0.00% GC)
  maximum time:     17.598 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Roger-luo commented 3 years ago

this is my current configurations

Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)

Are you on a laptop with high single core boost? Some laptops may run single core workloads at twice the clock speed as multicore loads.

I tried to search how to disable turbo boost for 3900X, but didn't find anything, is it an Intel-only thing? Or should I go to BIOS to turn off SMT? (But I mean I was only using 6 threads...)

Roger-luo commented 3 years ago

A bit further investigation shows that this seems only happens when the matrix S is large enough, e.g if I change S to

n = 10;
S = rand(ComplexF64, 100, 1<<n);

the benchmark scaling is perfect almost all-around 45.270 μs for the minimum time. Now this actually confuses me since we only allocate some temporary matrices on stack of size D (which is the size of the square matrix U) there is nothing scales with the size of S except the two outer most loop.

chriselrod commented 3 years ago

I tried to search how to disable turbo boost for 3900X, but didn't find anything, is it an Intel-only thing?

AMD (and many Intel systems) use the cpufreq driver instead of Intel pstate: https://www.kernel.org/doc/html/v4.12/admin-guide/pm/cpufreq.html#the-boost-file-in-sysfs There should be a file in /sys/devices/system/cpu/cpufreq/ which, when set to 0, will disable boosting. Setting to 1 will re-enable.

But you could also change settings in the bios. On a couple of my systems, I set single and multi-core speeds to be identical, e.g. 4.1/4.3/4.6 GHz for 512-bit/256-bit/other workloads, regardless of the number of active cores. This has the disadvantage of decreasing your peak single core speed, but the advantage of making benchmarking more reliable.

Aside from the fact that you found the problem goes away by changing the size of S (implying it isn't a thermal/boosting issue), I doubt it'll change very much. Maybe a few %, definitely not a 100% regression.

A bit further investigation shows that this seems only happens when the matrix S is large enough

Ah, to be fair, S is huge.

julia> ((1 << 20) * 100 * 2 * 8) / (1 << 30)
1.5625

That is 1.5625 GiB. Definitely doesn't fit in cache. So perhaps your CPU was limited on memory bandwidth for S?

Roger-luo commented 3 years ago

There should be a file in /sys/devices/system/cpu/cpufreq/ which, when set to 0, will disable boosting. Setting to 1 will re-enable.

Ah thanks! I found it, it's /sys/devices/system/cpu/cpufreq/boost

Aside from the fact that you found the problem goes away by changing the size of S (implying it isn't a thermal/boosting issue), I doubt it'll change very much. Maybe a few %, definitely not a 100% regression.

Yeah, it almost stays the same with or without boost...

So perhaps your CPU was limited on memory bandwidth for S?

Ah, I guess it is because the memory bandwidth is full for S which limits the performance of more threads then?

chriselrod commented 3 years ago

Ah, I guess it is because the memory bandwidth is full for S which limits the performance of more threads then?

Yeah, that's what I'm guessing. I see a lot more performance degredation on the 10980:

julia> ls = (l1,l2,l3,l4,l5,l6)
(21846, 43692, 65537, 87382, 109227, 131072)

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:1]))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     35.818 ms (0.00% GC)
  median time:      36.155 ms (0.00% GC)
  mean time:        36.220 ms (0.00% GC)
  maximum time:     37.016 ms (0.00% GC)
  --------------
  samples:          139
  evals/sample:     1

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:2]))
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     36.694 ms (0.00% GC)
  median time:      37.106 ms (0.00% GC)
  mean time:        37.166 ms (0.00% GC)
  maximum time:     38.114 ms (0.00% GC)
  --------------
  samples:          135
  evals/sample:     1

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:3]))
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  2
  --------------
  minimum time:     39.275 ms (0.00% GC)
  median time:      39.608 ms (0.00% GC)
  mean time:        39.645 ms (0.00% GC)
  maximum time:     40.538 ms (0.00% GC)
  --------------
  samples:          127
  evals/sample:     1

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:4]))
BenchmarkTools.Trial:
  memory estimate:  144 bytes
  allocs estimate:  3
  --------------
  minimum time:     41.150 ms (0.00% GC)
  median time:      41.533 ms (0.00% GC)
  mean time:        41.574 ms (0.00% GC)
  maximum time:     42.273 ms (0.00% GC)
  --------------
  samples:          121
  evals/sample:     1

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:5]))
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  4
  --------------
  minimum time:     44.290 ms (0.00% GC)
  median time:      45.160 ms (0.00% GC)
  mean time:        45.135 ms (0.00% GC)
  maximum time:     46.032 ms (0.00% GC)
  --------------
  samples:          111
  evals/sample:     1

julia> @benchmark variable_number($S, $indices, $U_re, $U_im, $subspace_ref, $(ls[1:6]))
BenchmarkTools.Trial:
  memory estimate:  240 bytes
  allocs estimate:  5
  --------------
  minimum time:     49.509 ms (0.00% GC)
  median time:      49.843 ms (0.00% GC)
  mean time:        49.839 ms (0.00% GC)
  maximum time:     50.488 ms (0.00% GC)
  --------------
  samples:          101
  evals/sample:     1

julia> versioninfo()
Julia Version 1.7.0-DEV.802
Commit 8d998dc8ec* (2021-04-02 13:34 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

But no sudden 2x.

Something else to note, in case I forgot to mention it earlier: If you start Julia with 6 threads, you can only launch for 1:5. Trying 6 will probably segfault; the assumption is that if you want 6 threads, you will launch 1:5 and then run the last chunk on that thread. Because this yields drastically better performance than launching 6 and waiting, I don't intend to support that.

Roger-luo commented 3 years ago

If you start Julia with 6 threads, you can only launch for 1:5. Trying 6 will probably segfault; the assumption is that if you want 6 threads, you will launch 1:5 and then run the last chunk on that thread. Because this yields drastically better performance than launching 6 and waiting, I don't intend to support that.

Thanks for the note, I'm currently using this strategy as well, after you suggesting me to use the main thread. Besides the bandwidth problem, it works quite perfectly now! Thanks! It seems to be something strange on my desktop (I don't get a good acceleration using other ways of multithreading in this case either) I'll close this and leave an issue in BQCESubroutine if someone hits it in the future.

JuliaSIMD / ThreadingUtilities.jl

overhead of each task still seems large (much better than @threads tho) #24