Unclear/misleading README.md

carstenbauer commented 10 months ago

Hi @chriselrod,

I think that the current README.md doesn't do a particularly good job at explaining how Polyester.jl works with respect to three (crucial) points:

How many "Polyester threads" are there in total (and does it vary dynamically)?
Are "Polyester threads" pinned to CPU-cores/CPU-threads?
How do "Polyester threads" relate to / cooperate with / interfere with Julia's task-based multithreading?

I think that sentences like the following are insufficient and potentially misleading:

Note that @batch defaults to using up to one thread per physical core, instead of 1 thread per CPU thread.

It suggests that Polyester potentially changes the number of threads (1) and also pins them to CPU-cores/threads (2). Of course, neither is the case. To be fair, that Polyester doesn't do (2) is mentioned in the README.md but only in a comment in a code example, which can easily be overlooked. The fact that Polyester just uses the already existing Julia threads (1) is also mentioned in the README (at a more prominent place) but is potentially misleading in the context of everything else that follows.

As for point (3) above, the README only compares to @threads which (as of now) defaults to :dynamic scheduling. This seems somewhat strange/misleading because (IIUC) @batch creates sticky tasks and and schedules the workload statically. Hence, it should really be compared to @threads :static. Of course, one can also compare to @threads :dynamic but the conceptual differences should be highlighted then IMHO.

Overall, IIUC, Polyester really just creates, manages, and re-uses a bunch of sticky tasks and just runs them on existing Julia threads (similar to @threads :static). However, IMHO, this isn't clear from the README.md which doesn't contain the words "sticky", "task(s)" or, most importantly, "static" at all.

If you're open to it, I might make a suggestion for an improved README.md in a PR later this week. Let me know what you think!

(BTW, apart from the stride option, Polyester.jl doesn't implement any form of load-balancing, right?)

chriselrod commented 10 months ago

Overall, IIUC, Polyester really just creates, manages, and re-uses a bunch of sticky tasks and just runs them on existing Julia threads (similar to @threads :static).

@threads :static does not reuse tasks, which is much of why Polyester should have lower overhead. I don't think the README examples have been rerun in a long time.

It might be that they haven't been re-run since before the :static to :dynamic switch.

If you're open to it, I might make a suggestion for an improved README.md in a PR later this week. Let me know what you think!

Certainly!

"3." is also definitely a major pain point.

BTW, apart from the stride option, Polyester.jl doesn't implement any form of load-balancing, right?

It does not. Static scheduling is much easier to implement. ;) Perhaps it could be fun to try implementing something like a work stealing scheduler at some point, but at that point, it may also be worth looking at Base Julia's scheduler.

carstenbauer commented 10 months ago

@threads :static does not reuse tasks, which is much of why Polyester should have lower overhead.

I know, but it that's not clear from the README :)

I don't think the README examples have been rerun in a long time.

It might be that they haven't been re-run since before the :static to :dynamic switch.

Makes sense, I will rerun the benchmarks when I update the README.

It does not. Static scheduling is much easier to implement. ;) Perhaps it could be fun to try implementing something like a work stealing scheduler at some point, but at that point, it may also be worth looking at Base Julia's scheduler.

Yes, a work stealing scheduler would be nice. But I agree, it would be nice if we could just have one smart scheduler instead of multiple (FWIW, Dagger.jl also implements a scheduler etc.)

carstenbauer commented 10 months ago

I'm rewriting the README.md and have a question about the following example

julia> let
           @batch threadlocal=rand(10:99) for i in 0:9
               println("index $i, thread $(Threads.threadid()), local storage $threadlocal")
               threadlocal += 1
           end
           println(threadlocal)
       end

index 8, thread 1, local storage 30
index 3, thread 3, local storage 49
index 9, thread 1, local storage 31
index 6, thread 4, local storage 33
index 0, thread 2, local storage 14
index 4, thread 3, local storage 50
index 7, thread 4, local storage 34
index 1, thread 2, local storage 15
index 5, thread 3, local storage 51
index 2, thread 2, local storage 16
Any[32, 17, 52, 35]

The storage doesn't seem to be thread-local: Julia thread 1 appears twice, once with local storage 30, another time with local storage 31. It also makes sense to me that it isn't thread-local. I would expect it to be "Polyester task"-local. Is that's what's happening here? If that's the case, I think threadlocal is a big misnomer.

BTW, is there a function like Polyester.taskid() that, similar to threadid() for Julia threads, returns a unique ID of a Polyester task. If not, would it make sense to add this?

chriselrod commented 10 months ago

Eh, I suspect it's a bit broken. Performance also isn't great.

Would be great if I (or someone else) can get around to reimplementing. I described the approach I'd take a few times, which is to actually make it polyester-task, which is already how LoopVectorization.jl does it. This lets @tturbo can multithread accumulations like sums/dot products without allocations.

threadlocal in Polyester is implemented differently. I'd have to check the source to see what it's actually doing (I should remember because I reviewed/approved the PR...alas), but I'd just discourage its use until it gets fixed. Performance won't be good.

I'm not sure what the best API to actually expose to users is, either, e.g., we could go with something like letting users provide a function f which we then call map(f, tasklocal_iterator) or reduce(f, tasklocal_iterator) with, to avoid exposing the user to this directly. That'd make it easier for us to safely avoid allocations.

mfsch commented 5 months ago

@carstenbauer, I was just trying to make sense of that threadlocal example and I agree it’s a bit confusing, but it seems to be consistent: The thread-local storage is initialized to a random value for each thread (here 30, 14, 49, and 33) and each iteration prints & increments the local value, so thread 1 prints 30 & 31, thread 2 prints 14, 15 & 16, and so on.

JuliaSIMD / Polyester.jl

Unclear/misleading README.md #124