Open tkf opened 4 years ago
@tkf - cool plot. If you don't mind, I'll use a similar concept for the ThreadPools logger in addition to the textual activity log I have now. (If you do mind, no worries.)
IIUC, @spawn
schedules the task on a random thread and puts it into the queue (at the C-level) of that thread before execution. So if 100 (Update - the assignment is based on the modulus of the current time, so while the starting thread is random, they should evenly divide as time progresses. This might explain why the jobs are assigned to threads in pairs in your example.) @spawns
are called, all 100 tasks are scheduled across threads before they have started. They might end up evenly spread (likely), but they might end up on one single thread, in theory.
I'd love for the C-level implementation of @spawn
to have the queued behavior we've been discussing. Then you'd have the pre-divided @threads
option and the queued @spawn
option to choose from. Add an option to prevent primary thread operation on either, and we have it all.
As for the pairing, I have no clue. I see a similar pairing on my machine, but I have a far worse grouping for some reason:
If it helps, I rewired your demo to use the activity log, and here are the results from a couple of runs (times are in ns, columns are threads 1:4 from left to right, number is active index, and * means more than one job active)
julia> include("script.jl")
0.000 - - - -
100000.000 1 - - -
200000.000 1 33 17 -
300000.000 2 34 17 -
400000.000 - 34 17 -
500000.000 35 9 18 -
600000.000 36 9 18 -
700000.000 36 10 49 -
800000.000 - 51 49 -
900000.000 5 51 50 -
1000000.000 6 52 50 -
1100000.000 6 52 50 -
1200000.000 41 25 3 -
1300000.000 41 25 3 -
1400000.000 42 26 - -
1500000.000 - 45 4 -
1600000.000 47 45 4 -
1700000.000 47 46 19 29
1800000.000 48 46 19 30
1900000.000 - 57 20 30
2000000.000 13 58 21 59
2100000.000 13 58 21 59
2200000.000 14 37 22 60
2300000.000 - 37 22 31
2400000.000 39 38 15 31
2500000.000 40 - 16 32
2600000.000 40 23 16 32
2700000.000 - 24 43 61
2800000.000 27 24 43 62
2900000.000 28 63 44 62
3000000.000 28 63 53 7
3100000.000 11 64 53 7
3200000.000 11 55 54 8
3300000.000 12 55 54 -
3400000.000 12 56 - -
3500000.000 - 56 - -
3600000.000 - - - -
3700000.000 - - - -
julia> include("script.jl")
0.000 - - - -
100000.000 - 33 49 41
200000.000 - 33 49 41
300000.000 - 33 *53 42
400000.000 - 34 *53 42
500000.000 - 34 *39 61
600000.000 - 59 *40 62
700000.000 - 59 *40 62
800000.000 - 60 *54 45
900000.000 - 60 *54 45
1000000.000 - 60 *57 46
1100000.000 - 35 *58 46
1200000.000 - 35 *58 63
1300000.000 - 35 *50 64
1400000.000 - 36 *50 64
1500000.000 - 36 *37 55
1600000.000 1 47 *38 55
1700000.000 2 47 *38 56
1800000.000 2 48 51 17
1900000.000 - 9 51 17
2000000.000 43 9 52 18
2100000.000 44 10 - 18
2200000.000 44 10 25 27
2300000.000 - 5 26 -
2400000.000 - 6 26 28
2500000.000 13 6 21 3
2600000.000 14 23 21 3
2700000.000 14 23 22 4
2800000.000 - 24 7 4
2900000.000 29 31 7 15
3000000.000 30 31 8 16
3100000.000 30 32 8 16
3200000.000 - 32 11 19
3300000.000 - - 12 19
3400000.000 - - 12 20
3500000.000 - - - -
3600000.000 - - - -
3700000.000 - - - -
@tro3 Please go ahead and use the concept or the code snippet or whatever. Having more visualizations in ThreadPools would be great.
Then you'd have the pre-divided
@threads
option
Just be clear, I don't want to do scheduling manually and/or statically. I just want to know the characteristics of the current/future scheduler and if there is a way to "nudge" the scheduler so that my program works nicely with it. For example, I don't want static scheduling as I'd like to have threaded reduce
that can be nested.
(But I agree that it'd be great have customizable scheduler interface as discussed in RFC: Make an abstraction for custom task parallel schedulers (GSoC proposal) - Internals & Design - JuliaLang)
If it helps, I rewired your demo to use the activity log,
Interesting. Do you mind sharing the script? Also, does it do I/O while logging? I specifically try to not yield to the scheduler so that it does not affect the task scheduling too much.
So it looks like I can encourage the scheduler to process earlier elements first by scheduling "bottom first":
function tmapreduce(f, op, xs; init, _task = nothing)
if length(xs) <= 2
_task === nothing || schedule(_task)
return mapreduce(f, op, xs; init = init)
end
mid = length(xs) ÷ 2
left = @view xs[1:mid]
right = @view xs[mid+1:end]
task = @task begin
_task === nothing || schedule(_task)
tmapreduce(f, op, right; init = init)
end
task.sticky = false
return op(tmapreduce(f, op, left; init = init, _task = task), fetch(task))
end
You can see that the bars are aligned somewhat diagonally and also there is not much loss of parallelism (though I'm just eyeballing):
Just be clear, I don't want to do scheduling manually and/or statically.
Totally get it. You want (I think) a queueing scheduler, but at the C-level. This is how I wish @spawn
worked, rather than pre-assigning.
Here is the hacked code using the ThreadPools logger.
import ThreadPools: showactivity, LogItem
function tmapreduce(f, op, xs; init)
if length(xs) <= 2
return mapreduce(f, op, xs; init = init)
end
mid = length(xs) ÷ 2
left = @view xs[1:mid]
right = @view xs[mid+1:end]
task = Threads.@spawn tmapreduce(f, op, right; init = init)
return op(tmapreduce(f, op, left; init = init), fetch(task))
end
function tmapreduce_profile(f, op, xs; init)
starts = resize!(typeof(time_ns())[], length(xs))
stops = resize!(typeof(time_ns())[], length(xs))
threadids = zeros(Int, length(xs))
io = open("tkflog.txt","w")
logger = Channel{LogItem}(16*1024) do c
for item in c
job, tid, st, t = item
write(io, "$job $tid $st $t\n")
end
end
t0 = time_ns()
value = tmapreduce(op, 1:length(xs); init = init) do i
threadids[i] = Threads.threadid()
put!(logger, (i, threadids[i], 'S', time_ns()-t0))
y = f(xs[i])
put!(logger, (i, threadids[i], 'P', time_ns()-t0))
return y
end
close(io)
return (value = value, starts = starts, stops = stops, threadids = threadids)
end
demo(; workload = 10_000, length = 64) =
tmapreduce_profile(+, 1:length; init = 0) do x
a = 1.0
for _ in 1:workload
a = sin(a)
end
x
end
demo()
result = demo()
showactivity("tkflog.txt", 100000)
I see that all of the bars occur as consecutive two bars like this
| |
with the same color (= thread).
Actually this is nothing to do with the scheduler and it's due to length(xs) <= 2
; i.e., the basecase processes two elements.
This makes scheduling perfectly "diagonal"
function tmapreduce(f, op, xs; init, _task = nothing)
if length(xs) <= 2
_task === nothing || schedule(_task)
return mapreduce(f, op, xs; init = init)
end
mid = length(xs) ÷ 2
left = @view xs[1:mid]
right = @view xs[mid+1:end]
task = @task begin
if _task === nothing
waiter = nothing
else
waiter = @task schedule(_task)
waiter.sticky = false
end
tmapreduce(f, op, right; init = init, _task = waiter)
end
task.sticky = false
return op(tmapreduce(f, op, left; init = init, _task = task), fetch(task))
end
It seems that the scheduler still can do load balancing with uneven workload:
The last figure is generated with
@@ -36,7 +36,7 @@ end
demo(; workload = 10_000, length = 64) =
tmapreduce_profile(+, 1:length; init = 0) do x
a = 1.0
- for _ in 1:workload
+ for _ in 1:rand(1:workload)
a = sin(a)
end
x
@tro3 Thanks for sharing the script. You have code like put!(logger, (i, threadids[i], 'S', time_ns()-t0))
. Did you notice if it changes the scheduling? IIUC put!
can yield to the scheduler which may decide to start new task.
@tkf - very cool results! Yes, I'm sure it can mess with the scheduling, but I am not using @spawn
and the jobs I am tracking are not as fast individually as the ones you have here. I'm going to fold in your cool graphical tool - that will give better insight than the textual graph I have now. Then I can try a with/without logging comparison and see the impact. If it does show a big difference, I'll have to move to an in-memory version of the same.
No question about the logger effect on the scheduler. With the logger, there is a 25% extra time added, and some clear noise that varies from run to run. Here is the same script without the logger, then with:
I'll have to look at an in-memory implementation, if possible. Otherwise just have to live with the overhead, since it is just for tuning anyway - comparisons between strategies should still be believable.
but I am not using
@spawn
and the jobs I am tracking
Ah, I see. I forgot that you are controlling the scheduling so there is less to worry.
With the logger, there is a 25% extra time added, and some clear noise that varies from run to run.
Thanks for trying it out with and without the logger. Nice to see that it skews the result (nice, in the sense that my caution was meaningful).
I'll have to look at an in-memory implementation, if possible.
I guess you can do something like
struct Record
time::UInt
... # things you want to record
end
const RECORDS = Vector{Record}[]
function __init__()
Threads.resize_nthreads!(RECORDS, Record[])
end
function pushrecord(r::Record)
push!(RECORDS[Threads.threadid()], r)
end
? Although you can't look at the log records online. I guess sizehint!
ing the Vector{Record}
s can also be also useful.
BTW, making the waiter
task sticky like this
@@ -11,7 +11,6 @@ function tmapreduce(f, op, xs; init, _task = nothing)
waiter = nothing
else
waiter = @task schedule(_task)
- waiter.sticky = false
end
tmapreduce(f, op, right; init = init, _task = waiter)
end
"breaks" the parallelism; i.e., there only two parallel tasks even though I have four threads:
Is it expected? As waiter
is not doing any computation, I thought sticky task was enough.
@tkf thanks for the in-memory tips. I put a Vector{LogItem} in the pool and it seems to work. I can just write to the file on closing the pool. I need to check the overhead, but it can’t be as bad as the Channel.
On sticky
, I have no clue. It is interesting though, that you only have two “blue” tasks. I bet those are the primary thread. What I can’t explain is why all three background threads are running after that, but only two at a time.
Yeah, the blue one is the primary thread because the first items are always processed by the primary thread. It'd be nice if core devs can demystify the two-at-the-time scheduling (and why it has to be non-sticky).
Actually, I don't know why I put the waiter
thing. This is enough:
function tmapreduce(f, op, xs; init, _task = nothing)
if length(xs) <= 2
_task === nothing || schedule(_task)
return mapreduce(f, op, xs; init = init)
end
mid = length(xs) ÷ 2
left = @view xs[1:mid]
right = @view xs[mid+1:end]
task = @task begin
tmapreduce(f, op, right; init = init, _task = _task)
end
task.sticky = false
return op(tmapreduce(f, op, left; init = init, _task = task), fetch(task))
end
I implemented this trick in Transducers.jl. Indeed, it makes early termination with reduce
much faster. Quoting https://github.com/tkf/Transducers.jl/pull/183#issuecomment-577040873
ID time ratio memory ratio ... ... ... ["findfirst", "n=1000", "reduce", "basesize=128"]
0.48 (5%) :white_check_mark: 0.45 (1%) :white_check_mark: ["findfirst", "n=1000", "reduce", "basesize=256"]
0.70 (5%) :white_check_mark: 0.57 (1%) :white_check_mark: ["findfirst", "n=1000", "reduce", "basesize=512"]
0.45 (5%) :white_check_mark: 0.37 (1%) :white_check_mark: ["findfirst", "n=400", "reduce", "basesize=128"]
0.96 (5%) 0.93 (1%) :white_check_mark: ["findfirst", "n=400", "reduce", "basesize=256"]
0.92 (5%) :white_check_mark: 0.86 (1%) :white_check_mark: ["findfirst", "n=400", "reduce", "basesize=512"]
0.85 (5%) :white_check_mark: 0.70 (1%) :white_check_mark: ["findfirst", "n=500", "reduce", "basesize=128"]
0.58 (5%) :white_check_mark: 0.41 (1%) :white_check_mark: ["findfirst", "n=500", "reduce", "basesize=256"]
0.22 (5%) :white_check_mark: 0.17 (1%) :white_check_mark: ["findfirst", "n=500", "reduce", "basesize=512"]
0.32 (5%) :white_check_mark: 0.22 (1%) :white_check_mark: ... ... ...
@tkf - interestingly enough, if you pull sticky=false
from that last solution, it never leaves the primary. It seems like sticky
stays on the same thread. In your earlier version, you made waiter
sticky, but the main task was not, so you got two threads. (Each thread tried to keep one job local and spawn the other elsewhere, but sticky
can't go elsewhere, so it just queued up behind the original.) If both waiter and the main task were sticky, it would stay on the primary.
Just a theory
I tried to analyze this super simple threaded map-reduce
(see below for the full code)
When I run this function with expensive
f
and four threads (my laptop has 4 physical cores), I see this profileThe vertical bars represent the interval for executing
f
. X axis is the array index. Colors areThreads.threadid()
.I see that all of the bars occur as consecutive two bars like thiswith the same color (= thread). Is this due to that the task is popped off from the head of local task deque or something? However, other than this property, I don't see any structure in the way the tasks are scheduled. (It's just an artifact: https://github.com/JuliaLang/julia/issues/34470#issuecomment-576904686)
Is this an expected behavior? From the description of "depth-first scheduling", I was (maybe too naively) expecting something like this
This behavior would be very useful to implement parallel findfirst and alike if this is possible. I think this also helps cache locality for reading the data (though it could be bad for writing due to false sharing). I think I can try to encourage (or even enforce) scheduling like this by using
Event
to order the scheduling. But I wondered if some kind of improvements for this respect are planned for Julia's scheduler.I'm using Julia 1.5.0-DEV.67. Julia 1.3.1 produces similar pictures.
Full code