STOR-i / GaussianProcesses.jl

A Julia package for Gaussian Processes
https://stor-i.github.io/GaussianProcesses.jl/latest/
Other
308 stars 53 forks source link

implement multi-threading for covariance and gradient matrices (WIP) #129

Open maximerischard opened 4 years ago

maximerischard commented 4 years ago

Single-threaded benchmark (JULIA_NUM_THREADS=1):

│ Row │ kernel                         │ times   │
│     │ String                         │ Float64 │
├─────┼────────────────────────────────┼─────────┤
│ 1   │ (se+mat12)*rq                  │ 2532.37 │
│ 2   │ fix(se, σ)                     │ 519.113 │
│ 3   │ mask(se, [1])                  │ 622.843 │
│ 4   │ mask(se, [1])+mask(rq, [2:10]) │ 1949.92 │
│ 5   │ mat12                          │ 612.091 │
│ 6   │ rq                             │ 1613.57 │
│ 7   │ se                             │ 609.192 │
│ 8   │ se*mat12                       │ 955.519 │
│ 9   │ se*rq                          │ 2169.75 │
│ 10  │ se+mat12                       │ 846.811 │
│ 11  │ se+mat12+rq                    │ 2108.69 │
│ 12  │ se+rq                          │ 2001.41 │

Multi-threaded benchmark (JULIA_NUM_THREADS=10):

│ Row │ kernel                         │ times   │
│     │ String                         │ Float64 │
├─────┼────────────────────────────────┼─────────┤
│ 1   │ (se+mat12)*rq                  │ 881.223 │
│ 2   │ fix(se, σ)                     │ 413.357 │
│ 3   │ mask(se, [1])                  │ 425.865 │
│ 4   │ mask(se, [1])+mask(rq, [2:10]) │ 724.22  │
│ 5   │ mat12                          │ 415.498 │
│ 6   │ rq                             │ 639.111 │
│ 7   │ se                             │ 440.963 │
│ 8   │ se*mat12                       │ 482.739 │
│ 9   │ se*rq                          │ 761.098 │
│ 10  │ se+mat12                       │ 486.406 │
│ 11  │ se+mat12+rq                    │ 787.688 │
│ 12  │ se+rq                          │ 656.913 │

I would be quite keen to get feedback from anyone who's worked with multi-threading in julia before.

Red-Portal commented 4 years ago

I think there should be a threshold that decides whether to use simd + threads or only simd. I doubt that threads will give good performance even with a small number of observations.

Also, did you consider using tasks? The task scheduling system is not yet in Julia but we could consider if there are some performance benefits there.

maximerischard commented 4 years ago

Hi @Red-Portal, thank you for having a look into this. You're right to point out that there should be a threshold to determine whether to use threads if there is some overhead in doing so. I'll do some more benchmarks to check if it's necessary. The benchmarks above show that threads do make a big difference to performance.

My understanding of tasks (coroutines) is that they solve a different problem: parallelising IO-bound tasks. Here multithreading is used to simultaneously compute multiple elements of the covariance matrix (and gradients). There is no communication or coordination between threads, so the overhead is low.

Red-Portal commented 4 years ago

Hi @maximerischard , yes as you said, tasks are used for IO bound operations. However they are also used to implement fine-grained parallelism. Julia's planned depth-first scheduling is targeted towards parallelism (not concurrency).

The performance improvements are wonderful BTW.

maximerischard commented 4 years ago

I've been reading a bit more about tasks and multithreading in julia. What I've done in this PR is use multithreading for the functions that iterate over the entries of an array. It's straightforwardly parallelisable, with no communication between threads, so low overhead. The overhead does seem to be non-zero, though when I benchmark it it's drowned in noise. Roughly speaking it seems to add 50ms to the benchmarks that take 1-2 seconds.

You had two suggestions @Red-Portal:

  1. use Tasks. I've been reading about tasks, and I don't see an obvious way to deploy them here, or an obvious benefit. You seem to have something in mind that I don't really understand. If you have a bit of time, it would be great if you could take a stab at implementing multithreading using tasks, or explain what you have in mind.
  2. set a threshold for multithreading. I tried this and ran into some type instability that made everything much slower. The overhead of using threads is fairly low, so I'm hoping it won't be necessary. I'd like to find a way to have no overhead if NTHREADS=1, but so far I haven't figured it out.
coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 568


Changes Missing Coverage Covered Lines Changed/Added Lines %
src/covariance/covariance.jl 17 23 73.91%
src/covariance/multithreaded.jl 0 72 0.0%
<!-- Total: 17 95 17.89% -->
Files with Coverage Reduction New Missed Lines %
src/kernels/distance.jl 1 87.5%
src/kernels/mat32_ard.jl 1 90.0%
src/kernels/mat32_iso.jl 1 90.0%
src/kernels/mat52_ard.jl 1 90.0%
src/kernels/mat52_iso.jl 1 90.0%
src/kernels/noise.jl 1 92.86%
src/kernels/periodic.jl 1 89.47%
src/kernels/rq_ard.jl 1 94.12%
src/kernels/rq_iso.jl 1 88.24%
src/likelihoods/bernoulli.jl 1 44.44%
<!-- Total: 111 -->
Totals Coverage Status
Change from base Build 563: -8.4%
Covered Lines: 1371
Relevant Lines: 2022

💛 - Coveralls
coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 625


Changes Missing Coverage Covered Lines Changed/Added Lines %
src/covariance/covariance.jl 17 23 73.91%
src/covariance/multithreaded.jl 0 72 0.0%
<!-- Total: 17 95 17.89% -->
Files with Coverage Reduction New Missed Lines %
src/kernels/poly.jl 1 80.65%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 617: -3.1%
Covered Lines: 1460
Relevant Lines: 1987

💛 - Coveralls