JuliaSIMD / Polyester.jl

The cheapest threads you can find!
MIT License
240 stars 13 forks source link

Workaround for `@cfunction` closures on unsupported platforms #131

Open Keluaa opened 7 months ago

Keluaa commented 7 months ago

Fixes #88 by storing the BatchClosure object in the thread local storage of each task.

Architectures which do not support closures are detected in the same way as in this line, which should cover all affected platforms.

Tested on Nvidia Grace, not on Mac. Comparing this method with the @cfunction closure on x86, the overhead seems to be almost the same, but with additional allocations when using threadlocal.

Keluaa commented 7 months ago

I might have PR'ed too soon, as I see some #undefs with when using threadlocal for threads when using more than 65 threads, here on Nvidia Grace, with julia -t 72:

julia> using Polyester

julia> let
           @batch threadlocal=rand(10:99) for i in 0:Threads.nthreads()
               threadlocal += 1
           end
           findall(i -> !isassigned(threadlocal, i), 1:length(threadlocal))
       end
7-element Vector{Int64}:
 65
 66
 67
 68
 69
 70
 71

All threads after the 64-th are #undef, appart from the very last one. Always happens with more than 65 threads.

Keluaa commented 7 months ago

Well... I might have forgotten to commit a single line from my PR a year ago, sorry about that. Somehow I didn't stumble upon this sooner.

Now everything is working fine on my side. CI is failing but apparently it is only because of Aqua.jl, all multithreading tests are passing.

Keluaa commented 7 months ago

I found that while this works, it does allocate a bit more than what I would like (which is zero), and it seems to be proportional to the number of threads, which is too detrimental for performance. More info tomorrow.

Keluaa commented 7 months ago

Polyester is supposed to try and avoid closures closing over objects by identifying the objects that would be captured, and then passing them as function arguments instead.

But currently users are oblivious of this optimization on x86, even when it fails and falls back to a closure instead, as performance is only slightly altered. I think that having a working fallback on all platforms might be preferable over the current unhelpful error cfunction: closures are not supported on this platform. Maybe adding a warning would help? Something such as some captured variables couldn't be turned into function arguments, consider changing how those variables are used by the @batch loop?

I found that while this works, it does allocate a bit more than what I would like

It seems that the allocation issue is unrelated to my changes, and only happen on ARM when using --check-bounds=no, because of a type-inference issue (#132).