investigate multiple core usage for the local BPS

alan-turing-institute / PDSampler.jl

Piecewise Deterministic Sampler library (Bouncy particle sampler, Zig Zag sampler, ...)

Other

33 stars 8 forks source link

investigate multiple core usage for the local BPS #15

Closed tlienart closed 7 years ago

tlienart commented 7 years ago

In Julia:

Sys.CPU_CORES

may return something like 4 (on my machine). In which case julia can be run with julia -p 4.

In theory when one factor is updated, all adjacent factors should be updated in that their respective bounce time should be updated. All of those updates are independent and so we should be able to use a pmap to do it.

tlienart commented 7 years ago

In the initialisation, the loop is:

for (fidx, factor) in enumerate(sim.fg.factors)
        vars   = assocvariables(sim.fg, fidx)
        xf, vf = sim.x0[vars], sim.v0[vars]
        g      = factor.gll(vcat(xf...))
        pq     = ls_updatepq!(pq, sim.fg, fidx, xf, vf, g, 0.0)
    end

A very similar loop happens in the body of the main function. All the operations are independent. Essentially:

a. look at current state or nodes around factors b. compute a bouncing time for that factor c. store the bouncing time in PQ

(a&b) should be done in //. The returned collection of times can then be fed in the PQ.

action

write a function that will ultimately replace ls_updatepq! by a similar function that just returns the bouncing time
collects bouncing time and store them in PQ.

tlienart commented 7 years ago

Since Julia takes every processor to potentially be on an independent machine, one also needs to send all of the relevant data to it so that it can do its computation and then send it back. So we have to modify the function a bit more and send all the relevant stuff with it. This will probably make the code somewhat uglier.

tlienart commented 7 years ago

So apart from wasting a lot of time on resources with poor documentation I didn't get very far.

Inherently there's a requirement of moving stuff around to the other physical cores. It's unclear to me whether the speedup of doing things "in parallel" is not completely overwhelmed by having to copy data structures around.

I've tried using a distributedarray with a @parallel for but didn't get anywhere and did not find good examples of similar use online. I'm putting this on the side for now waiting to get feedback from Leonard who has some experience with that kind of stuff.

tlienart commented 7 years ago

Actually there seems to be capacity for multithreading with shared memory using the @threads macro and related Threads etc. Links to look up:

I need to figure out a simple example with a clear speedup first then test it in our environment.

tlienart commented 7 years ago

In bash

export JULIA_NUM_THREADS=4

as per https://docs.julialang.org/en/latest/manual/parallel-computing#Multi-Threading-(Experimental)-1

then in julia

f(x) = sin(x^2)
function trial(N::Int64)
   r = rand(N)
   x = similar(r)

   println("basic")
   @time for k in eachindex(r) 
       x[k] = f(r[k])
   end

#   @time x = f.(r)  # broadcasting
#   @time x .= f.(r)  # in-place broadcasting

   println("threaded")
   y = similar(r)  # have to make a new array here for some reason
   @time Threads.@threads for k in eachindex(r) 
       y[k] = f(r[k])
   end

   return x == y
end

trial(1000)
trial(10^8)
trial(10^8)

Observed result

julia> trial(1000)
basic
  0.000025 seconds
threaded
  0.024379 seconds (6.79 k allocations: 293.707 KB)
true

julia> trial(10^8)
basic
  1.843685 seconds
threaded
  0.788258 seconds (2 allocations: 64 bytes)
true

julia> trial(10^8)
basic
  1.794225 seconds
threaded
  0.760491 seconds (2 allocations: 64 bytes)
true

So a non-negligible 2+ times speedup...

tlienart commented 7 years ago

So I've implemented and tested this in the "first branch" section

function ls_firstbranch!(fg::FactorGraph, fidx::Int, all_evlist::AllEventList,
                         pq::PriorityQueue, t::Float
                         )::Tuple{AllEventList,PriorityQueue}
    # retrieve xf, vf corresponding to factor
    (xf, vf, g, vars) = ls_retrieve(fg, fidx, all_evlist, t, true)
    ls_saveupdate!(all_evlist, vars, xf, vf, t)
    ls_updatepq!(pq, fg, fidx, xf, vf, g, t)
    # same story for linked factors (fp)
    Threads.@threads for fpidx in linkedfactors(fg, fidx)
        # we don't need to retrieve `vars` here
        (xfp, vfp, gp) = ls_retrieve(fg, fpidx, all_evlist, t)
        ls_updatepq!(pq, fg, fpidx, xfp, vfp, gp, t)
    end
    (all_evlist, pq)
end

This "works". It remains to be tested extensively to see whether it brings a speedup.

tlienart commented 7 years ago

randn locks threading, can circumvent with seeded randn
allocations locks threading, harder to circumvent especially when calling a stack of functions

-> making multithreading truly work here would require significant thinking, dropping the issue for now.