Closed DilumAluthge closed 3 years ago
Merging #17 (1a38653) into master (5975d5c) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #17 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 9 10 +1
Lines 109 122 +13
=========================================
+ Hits 109 122 +13
Impacted Files | Coverage Δ | |
---|---|---|
src/Octavian.jl | 100.00% <ø> (ø) |
|
src/block_sizes.jl | 100.00% <ø> (ø) |
|
src/init.jl | 100.00% <100.00%> (ø) |
|
src/matmul.jl | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 5975d5c...1a38653. Read the comment docs.
Actually, looks like some of the CI jobs fail even if we only multithread loop 3.
@chriselrod @MasonProtter What do you think? Is this the right approach?
There are three issues I'm currently trying to address with multithreading matmul in PaddedMatrices:
B
. A little more detail on each:
B
At the time I'm writing this, there's a bug in the PR where multiple threads will try to read from and write to the same block of memory when packing B
.
One solution is to just give each thread a separate, offset, pointer so that the regions they will be reading/writing do not overlap.
However this is less than ideal.
The packed B
is supposed to fit in the L3 cache. The L3 cache is normally shared among multiple cores. The most common situation is that there's a single L3 cache shared by all the cores, although with Zen2 and Zen3 CPUs there is an L3 cache shared by clusters of 4 and 8 cores, respectively. And obviously a dual socket system will have separate L3s for each socket.
But in each of these, the L3 is still shared by some number of cores.
Meaning, if we were to give each of these cores a separate region of memory, it wouldn't mean giving each of them their own Bblock
. It would mean dividing the Bblock
into pieces, and giving each of them a much smaller Bblock
.
If shared by 4 cores, then N_C
would be divided by 4. If shared by 18, it'd be divided by 18.
The number of times we execute the M
and K
loops, which includes packing A
, would then increase by the same factor.
A better solution would be to share the packed Bblock
among the threads.
It would make sense for every thread on a different block of the M
partition to use the same full-sized Bblock
.
Of course, that's what you already get when only threading loop 3, which brings us to
This means we should favor loop 3 over loop 5. But of course, if loop 3 is too short relative to how many cores the CPU has, it'd probably be faster to thread loop 5 too. As an extreme example, say we have
using Octavian
A = rand(64,100);
B = rand(100,96);
C = Matrix{Float64}(undef, size(A,1), size(B,2));
Octavian.matul!(C, A, B)
on a CPU with AVX2 and 8 cores. These matrices are small enough that we probably do not want to pack at all.
We probably have M_C = 72
, and N_C > 1000
. So does that mean we should evaluate it with a single thread?
That if A
had 144 rows instead, we'd want to use just 2 threads?
No, in this example we may want to partition A
into 16x100 blocks, and B
into 100x24 blocks, and use all 8 threads.
(Unless threading overhead becomes problematic.)
Alternatively, what if A
and B
were each 10_000x10_000?
Would we want to spawn >1000 tasks (10_000 / 72 * 10_000 / 1000 ≈ 1_389 -- or many more, if we consider the reduced N_c we get by splitting up Bpacked
)?
No, that'd incur a lot of scheduling overhead.
Instead, we probably want just 1 task per thread, and to figure out what each of these tasks should be doing based on the dimensions of our inputs.
If they're small, we want to cut them into little pieces. If they're big, we probably only want to parallelize loop3, and to have just 1 task per thread that itself iterates over the M_c
-sized chunks.
Ideally, we also wouldn't have to spawn tasks for every N_c
iteration either, and instead synchronize using spinlocks, for example.
Instead, we probably want just 1 task per thread, and to figure out what each of these tasks should be doing based on the dimensions of our inputs.
That sounds like it probably would be the simplest approach. And in this case, it would never be the case that two different tasks are trying to write to the same memory, right?
Of course, that's what you already get when only threading loop 3, which brings us to
What bugs me about this PR is that tests are failing even when we only thread loop 3.
Also, working under the "one task per thread" idea, we should think about how we handle the oversubscription case.
On Julia master, if I set the JULIA_NUM_THREADS
variable to 99
, it will give me 99 threads even if the CPU only has 8 CPU threads.
Of course, the converse case is that if I set the JULIA_NUM_THREADS
variable to 1
, it will give me one thread even if the CPU has 8 threads.
So, maybe, we want to define N = max(Threads.nthreads(), NUM_OF_CPU_THREADS)
. And then we spawn N
tasks. Would that make sense?
That sounds like it probably would be the simplest approach. And in this case, it would never be the case that two different tasks are trying to write to the same memory, right?
Unless we made a mistake implementing the algorithm. ;)
Also, working under the "one task per thread" idea, we should think about how we handle the oversubscription case.
On Julia master, if I set the
JULIA_NUM_THREADS
variable to99
, it will give me 99 threads even if the CPU only has 8 CPU threads.So, maybe, we want to define
N = max(Threads.nthreads(), NUM_OF_CPU_THREADS)
. And then we spawnN
tasks. Would that make sense?
Yes, but it'd be better to use VectorizationBase.NUM_CORES
than Sys.CPU_THREADS
.
Note that we only have one L2 cache per physical core. So if we wanted to use more threads than we have L2 caches, we would have to reduce the size of our packed A matrix (meaning we'd have to reduce M_c * K_c
).
What bugs me about this PR is that tests are failing even when we only thread loop 3.
Oops, same mistake as with the packed-B. We define Amem
at the top, so we also have only 1 for all the threads, making them try to read/write to it at the same time, just like the bug with packed-B.
I'd just move that definition (and the GC.@preserve
) inside the @spawn
of the 3rd loop.
What bugs me about this PR is that tests are failing even when we only thread loop 3.
Oops, same mistake as with the packed-B. We define
Amem
at the top, so we also have only 1 for all the threads, making them try to read/write to it at the same time, just like the bug with packed-B.I'd just move that definition (and the
GC.@preserve
) inside the@spawn
of the 3rd loop.
Hmmm, I've done that, but I am still getting test failures.
The code is currently still threading loop 5, so you still have the Bptr
problem.
Should hopefully be working now.
Maybe we should make some of the test matrices a little smaller again 😂
This PR currently only threads loop 3.
https://github.com/flame/blis/blob/master/docs/Multithreading.md