Closed tlienart closed 7 years ago
In the initialisation, the loop is:
for (fidx, factor) in enumerate(sim.fg.factors)
vars = assocvariables(sim.fg, fidx)
xf, vf = sim.x0[vars], sim.v0[vars]
g = factor.gll(vcat(xf...))
pq = ls_updatepq!(pq, sim.fg, fidx, xf, vf, g, 0.0)
end
A very similar loop happens in the body of the main function. All the operations are independent. Essentially:
a. look at current state or nodes around factors b. compute a bouncing time for that factor c. store the bouncing time in PQ
(a&b) should be done in //. The returned collection of times can then be fed in the PQ.
action
ls_updatepq!
by a similar function that just returns the bouncing timeSince Julia takes every processor to potentially be on an independent machine, one also needs to send all of the relevant data to it so that it can do its computation and then send it back. So we have to modify the function a bit more and send all the relevant stuff with it. This will probably make the code somewhat uglier.
So apart from wasting a lot of time on resources with poor documentation I didn't get very far.
Inherently there's a requirement of moving stuff around to the other physical cores. It's unclear to me whether the speedup of doing things "in parallel" is not completely overwhelmed by having to copy data structures around.
I've tried using a distributedarray with a @parallel for
but didn't get anywhere and did not find good examples of similar use online. I'm putting this on the side for now waiting to get feedback from Leonard who has some experience with that kind of stuff.
Actually there seems to be capacity for multithreading with shared memory using the @threads
macro and related Threads
etc. Links to look up:
I need to figure out a simple example with a clear speedup first then test it in our environment.
In bash
export JULIA_NUM_THREADS=4
as per https://docs.julialang.org/en/latest/manual/parallel-computing#Multi-Threading-(Experimental)-1
then in julia
f(x) = sin(x^2)
function trial(N::Int64)
r = rand(N)
x = similar(r)
println("basic")
@time for k in eachindex(r)
x[k] = f(r[k])
end
# @time x = f.(r) # broadcasting
# @time x .= f.(r) # in-place broadcasting
println("threaded")
y = similar(r) # have to make a new array here for some reason
@time Threads.@threads for k in eachindex(r)
y[k] = f(r[k])
end
return x == y
end
trial(1000)
trial(10^8)
trial(10^8)
Observed result
julia> trial(1000)
basic
0.000025 seconds
threaded
0.024379 seconds (6.79 k allocations: 293.707 KB)
true
julia> trial(10^8)
basic
1.843685 seconds
threaded
0.788258 seconds (2 allocations: 64 bytes)
true
julia> trial(10^8)
basic
1.794225 seconds
threaded
0.760491 seconds (2 allocations: 64 bytes)
true
So a non-negligible 2+ times speedup...
So I've implemented and tested this in the "first branch" section
function ls_firstbranch!(fg::FactorGraph, fidx::Int, all_evlist::AllEventList,
pq::PriorityQueue, t::Float
)::Tuple{AllEventList,PriorityQueue}
# retrieve xf, vf corresponding to factor
(xf, vf, g, vars) = ls_retrieve(fg, fidx, all_evlist, t, true)
ls_saveupdate!(all_evlist, vars, xf, vf, t)
ls_updatepq!(pq, fg, fidx, xf, vf, g, t)
# same story for linked factors (fp)
Threads.@threads for fpidx in linkedfactors(fg, fidx)
# we don't need to retrieve `vars` here
(xfp, vfp, gp) = ls_retrieve(fg, fpidx, all_evlist, t)
ls_updatepq!(pq, fg, fpidx, xfp, vfp, gp, t)
end
(all_evlist, pq)
end
This "works". It remains to be tested extensively to see whether it brings a speedup.
-> making multithreading truly work here would require significant thinking, dropping the issue for now.
In Julia:
may return something like
4
(on my machine). In which case julia can be run withjulia -p 4
.In theory when one factor is updated, all adjacent factors should be updated in that their respective bounce time should be updated. All of those updates are independent and so we should be able to use a
pmap
to do it.