Open carstenbauer opened 10 months ago
Overall, IIUC, Polyester really just creates, manages, and re-uses a bunch of sticky tasks and just runs them on existing Julia threads (similar to @threads :static).
@threads :static
does not reuse tasks, which is much of why Polyester should have lower overhead. I don't think the README examples have been rerun in a long time.
It might be that they haven't been re-run since before the :static
to :dynamic
switch.
If you're open to it, I might make a suggestion for an improved README.md in a PR later this week. Let me know what you think!
Certainly!
"3." is also definitely a major pain point.
BTW, apart from the stride option, Polyester.jl doesn't implement any form of load-balancing, right?
It does not. Static scheduling is much easier to implement. ;) Perhaps it could be fun to try implementing something like a work stealing scheduler at some point, but at that point, it may also be worth looking at Base Julia's scheduler.
@threads :static
does not reuse tasks, which is much of why Polyester should have lower overhead.
I know, but it that's not clear from the README :)
I don't think the README examples have been rerun in a long time.
It might be that they haven't been re-run since before the
:static
to:dynamic
switch.
Makes sense, I will rerun the benchmarks when I update the README.
It does not. Static scheduling is much easier to implement. ;) Perhaps it could be fun to try implementing something like a work stealing scheduler at some point, but at that point, it may also be worth looking at Base Julia's scheduler.
Yes, a work stealing scheduler would be nice. But I agree, it would be nice if we could just have one smart scheduler instead of multiple (FWIW, Dagger.jl also implements a scheduler etc.)
I'm rewriting the README.md and have a question about the following example
julia> let
@batch threadlocal=rand(10:99) for i in 0:9
println("index $i, thread $(Threads.threadid()), local storage $threadlocal")
threadlocal += 1
end
println(threadlocal)
end
index 8, thread 1, local storage 30
index 3, thread 3, local storage 49
index 9, thread 1, local storage 31
index 6, thread 4, local storage 33
index 0, thread 2, local storage 14
index 4, thread 3, local storage 50
index 7, thread 4, local storage 34
index 1, thread 2, local storage 15
index 5, thread 3, local storage 51
index 2, thread 2, local storage 16
Any[32, 17, 52, 35]
The storage doesn't seem to be thread-local: Julia thread 1 appears twice, once with local storage 30, another time with local storage 31. It also makes sense to me that it isn't thread-local. I would expect it to be "Polyester task"-local. Is that's what's happening here? If that's the case, I think threadlocal
is a big misnomer.
BTW, is there a function like Polyester.taskid()
that, similar to threadid()
for Julia threads, returns a unique ID of a Polyester task. If not, would it make sense to add this?
Eh, I suspect it's a bit broken. Performance also isn't great.
Would be great if I (or someone else) can get around to reimplementing.
I described the approach I'd take a few times, which is to actually make it polyester-task, which is already how LoopVectorization.jl does it. This lets @tturbo
can multithread accumulations like sums/dot products without allocations.
threadlocal
in Polyester is implemented differently.
I'd have to check the source to see what it's actually doing (I should remember because I reviewed/approved the PR...alas), but I'd just discourage its use until it gets fixed. Performance won't be good.
I'm not sure what the best API to actually expose to users is, either, e.g., we could go with something like letting users provide a function f
which we then call map(f, tasklocal_iterator)
or reduce(f, tasklocal_iterator)
with, to avoid exposing the user to this directly.
That'd make it easier for us to safely avoid allocations.
@carstenbauer, I was just trying to make sense of that threadlocal
example and I agree it’s a bit confusing, but it seems to be consistent: The thread-local storage is initialized to a random value for each thread (here 30, 14, 49, and 33) and each iteration prints & increments the local value, so thread 1 prints 30 & 31, thread 2 prints 14, 15 & 16, and so on.
Hi @chriselrod,
I think that the current README.md doesn't do a particularly good job at explaining how Polyester.jl works with respect to three (crucial) points:
I think that sentences like the following are insufficient and potentially misleading:
It suggests that Polyester potentially changes the number of threads (1) and also pins them to CPU-cores/threads (2). Of course, neither is the case. To be fair, that Polyester doesn't do (2) is mentioned in the README.md but only in a comment in a code example, which can easily be overlooked. The fact that Polyester just uses the already existing Julia threads (1) is also mentioned in the README (at a more prominent place) but is potentially misleading in the context of everything else that follows.
As for point (3) above, the README only compares to
@threads
which (as of now) defaults to:dynamic
scheduling. This seems somewhat strange/misleading because (IIUC)@batch
creates sticky tasks and and schedules the workload statically. Hence, it should really be compared to@threads :static
. Of course, one can also compare to@threads :dynamic
but the conceptual differences should be highlighted then IMHO.Overall, IIUC, Polyester really just creates, manages, and re-uses a bunch of sticky tasks and just runs them on existing Julia threads (similar to
@threads :static
). However, IMHO, this isn't clear from the README.md which doesn't contain the words "sticky", "task(s)" or, most importantly, "static" at all.If you're open to it, I might make a suggestion for an improved README.md in a PR later this week. Let me know what you think!
(BTW, apart from the
stride
option, Polyester.jl doesn't implement any form of load-balancing, right?)