JuliaLinearAlgebra / Octavian.jl

Multi-threaded BLAS-like library that provides pure Julia matrix multiplication
https://julialinearalgebra.github.io/Octavian.jl/stable/
Other
232 stars 18 forks source link

First attempt at multithreading: thread loop 3 only #17

Closed DilumAluthge closed 3 years ago

DilumAluthge commented 3 years ago

This PR currently only threads loop 3.

https://github.com/flame/blis/blob/master/docs/Multithreading.md

codecov[bot] commented 3 years ago

Codecov Report

Merging #17 (1a38653) into master (5975d5c) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #17   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            9        10    +1     
  Lines          109       122   +13     
=========================================
+ Hits           109       122   +13     
Impacted Files Coverage Δ
src/Octavian.jl 100.00% <ø> (ø)
src/block_sizes.jl 100.00% <ø> (ø)
src/init.jl 100.00% <100.00%> (ø)
src/matmul.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5975d5c...1a38653. Read the comment docs.

DilumAluthge commented 3 years ago

Actually, looks like some of the CI jobs fail even if we only multithread loop 3.

DilumAluthge commented 3 years ago

@chriselrod @MasonProtter What do you think? Is this the right approach?

chriselrod commented 3 years ago

There are three issues I'm currently trying to address with multithreading matmul in PaddedMatrices:

  1. Packing B.
  2. Good utilization.
  3. Overhead.

A little more detail on each:

Packing B

At the time I'm writing this, there's a bug in the PR where multiple threads will try to read from and write to the same block of memory when packing B. One solution is to just give each thread a separate, offset, pointer so that the regions they will be reading/writing do not overlap.

However this is less than ideal.

The packed B is supposed to fit in the L3 cache. The L3 cache is normally shared among multiple cores. The most common situation is that there's a single L3 cache shared by all the cores, although with Zen2 and Zen3 CPUs there is an L3 cache shared by clusters of 4 and 8 cores, respectively. And obviously a dual socket system will have separate L3s for each socket.

But in each of these, the L3 is still shared by some number of cores. Meaning, if we were to give each of these cores a separate region of memory, it wouldn't mean giving each of them their own Bblock. It would mean dividing the Bblock into pieces, and giving each of them a much smaller Bblock. If shared by 4 cores, then N_C would be divided by 4. If shared by 18, it'd be divided by 18.

The number of times we execute the M and K loops, which includes packing A, would then increase by the same factor.

A better solution would be to share the packed Bblock among the threads. It would make sense for every thread on a different block of the M partition to use the same full-sized Bblock.

Of course, that's what you already get when only threading loop 3, which brings us to

Good utilization

This means we should favor loop 3 over loop 5. But of course, if loop 3 is too short relative to how many cores the CPU has, it'd probably be faster to thread loop 5 too. As an extreme example, say we have

using Octavian
A = rand(64,100);
B = rand(100,96);
C = Matrix{Float64}(undef, size(A,1), size(B,2));
Octavian.matul!(C, A, B)

on a CPU with AVX2 and 8 cores. These matrices are small enough that we probably do not want to pack at all.

We probably have M_C = 72, and N_C > 1000. So does that mean we should evaluate it with a single thread? That if A had 144 rows instead, we'd want to use just 2 threads?

No, in this example we may want to partition A into 16x100 blocks, and B into 100x24 blocks, and use all 8 threads. (Unless threading overhead becomes problematic.)

Alternatively, what if A and B were each 10_000x10_000? Would we want to spawn >1000 tasks (10_000 / 72 * 10_000 / 1000 ≈ 1_389 -- or many more, if we consider the reduced N_c we get by splitting up Bpacked)? No, that'd incur a lot of scheduling overhead.

Overhead

Instead, we probably want just 1 task per thread, and to figure out what each of these tasks should be doing based on the dimensions of our inputs. If they're small, we want to cut them into little pieces. If they're big, we probably only want to parallelize loop3, and to have just 1 task per thread that itself iterates over the M_c-sized chunks. Ideally, we also wouldn't have to spawn tasks for every N_c iteration either, and instead synchronize using spinlocks, for example.

DilumAluthge commented 3 years ago

Instead, we probably want just 1 task per thread, and to figure out what each of these tasks should be doing based on the dimensions of our inputs.

That sounds like it probably would be the simplest approach. And in this case, it would never be the case that two different tasks are trying to write to the same memory, right?

DilumAluthge commented 3 years ago

Of course, that's what you already get when only threading loop 3, which brings us to

What bugs me about this PR is that tests are failing even when we only thread loop 3.

DilumAluthge commented 3 years ago

Also, working under the "one task per thread" idea, we should think about how we handle the oversubscription case.

On Julia master, if I set the JULIA_NUM_THREADS variable to 99, it will give me 99 threads even if the CPU only has 8 CPU threads.

Of course, the converse case is that if I set the JULIA_NUM_THREADS variable to 1, it will give me one thread even if the CPU has 8 threads.

So, maybe, we want to define N = max(Threads.nthreads(), NUM_OF_CPU_THREADS). And then we spawn N tasks. Would that make sense?

chriselrod commented 3 years ago

That sounds like it probably would be the simplest approach. And in this case, it would never be the case that two different tasks are trying to write to the same memory, right?

Unless we made a mistake implementing the algorithm. ;)

Also, working under the "one task per thread" idea, we should think about how we handle the oversubscription case.

On Julia master, if I set the JULIA_NUM_THREADS variable to 99, it will give me 99 threads even if the CPU only has 8 CPU threads.

So, maybe, we want to define N = max(Threads.nthreads(), NUM_OF_CPU_THREADS). And then we spawn N tasks. Would that make sense?

Yes, but it'd be better to use VectorizationBase.NUM_CORES than Sys.CPU_THREADS.

Note that we only have one L2 cache per physical core. So if we wanted to use more threads than we have L2 caches, we would have to reduce the size of our packed A matrix (meaning we'd have to reduce M_c * K_c).

chriselrod commented 3 years ago

What bugs me about this PR is that tests are failing even when we only thread loop 3.

Oops, same mistake as with the packed-B. We define Amem at the top, so we also have only 1 for all the threads, making them try to read/write to it at the same time, just like the bug with packed-B.

I'd just move that definition (and the GC.@preserve) inside the @spawn of the 3rd loop.

DilumAluthge commented 3 years ago

What bugs me about this PR is that tests are failing even when we only thread loop 3.

Oops, same mistake as with the packed-B. We define Amem at the top, so we also have only 1 for all the threads, making them try to read/write to it at the same time, just like the bug with packed-B.

I'd just move that definition (and the GC.@preserve) inside the @spawn of the 3rd loop.

Hmmm, I've done that, but I am still getting test failures.

chriselrod commented 3 years ago

The code is currently still threading loop 5, so you still have the Bptr problem.

DilumAluthge commented 3 years ago

Should hopefully be working now.

DilumAluthge commented 3 years ago

image

DilumAluthge commented 3 years ago

Maybe we should make some of the test matrices a little smaller again 😂