carstenbauer / ThreadPinning.jl

Readily pin Julia threads to CPU-threads
https://carstenbauer.github.io/ThreadPinning.jl/
MIT License
106 stars 7 forks source link

OpenBLAS thread pinning re-assigns one Julia thread to wrong CPU thread #105

Open oschulz opened 1 month ago

oschulz commented 1 month ago

With JULIA_NUM_THREADS=6 and OPENBLAS_NUM_THREADS=6 and

julia> using ThreadPinning
julia> using ThreadPinning: cpuids

julia> threadinfo()
Hostname:       ...
CPU(s):         1 x 13th Gen Intel(R) Core(TM) i9-13900H
CPU target:     goldmont
Cores:          14 (20 CPU-threads due to 2-way SMT)
Core kinds:     8 "efficiency cores", 6 "performance cores".
NUMA domains:   1 (14 cores each)

Julia threads:  6

CPU socket 1
  0,1, 2,3, 4,5, 6,7, 8,9, 10,11, 12, 13, 14, 15, 
  16, 17, 18, 19

julia> perf_cpus = perf_cpus = filter(i -> !isefficiencycore(i), cpuids()); string(perf_cpus)
"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]"

julia> non_ht_threads = filter(!ishyperthread, perf_cpus); string(non_ht_threads)
"[0, 2, 4, 6, 8, 10]"

julia> ht_threads = filter(ishyperthread, perf_cpus); string(ht_threads)
"[1, 3, 5, 7, 9, 11]"

pinning the Julia threads to the "non-HT" performance CPU-threads works as expected:

julia> pinthreads(non_ht_threads)

julia> string(getcpuids())
"[0, 2, 4, 6, 8, 10]"

julia> string(ThreadPinning.openblas_getcpuids())
ERROR: The affinity mask of OpenBLAS thread 1 includes multiple CPU threads. This likely indicates that this OpenBLAS hasn't been pinned yet.

But after pinning the OpenBLAS threads to the "other half", so the "HT" CPU-threads of the performance cores

julia> ThreadPinning.openblas_pinthreads(ht_threads)

the Julia thread on CPU 0 gets reassigned to CPU 11, sharing that CPU thread with OpenBLAS and another Julia thread:

julia> string(getcpuids())
"[11, 2, 4, 6, 8, 10]"

julia> string(ThreadPinning.openblas_getcpuids())
"[1, 3, 5, 7, 9, 11]"

which is obviously not what we want. When trying to fix this by re-pinning the Julia threads

julia> pinthreads(non_ht_threads)

julia> string(getcpuids())
"[0, 2, 4, 6, 8, 10]"

julia> string(ThreadPinning.openblas_getcpuids())
"[1, 3, 5, 7, 9, 0]"

we end up with an OpenBLAS thread that shares a CPU thread with a Julia thread and another OpenBLAS thread.

(ThreadPinning v1.0.2, SysInfo v0.3.0).

carstenbauer commented 1 month ago

I will look into the fundamental issue when I find the time for it.

In the meantime, why are you using SysInfo directly? ThreadPinning.jl should be enough. It re-exports all the functions you're using (if not it's probably a oversight on my end). Also, I think that threadinfo is better than sysinfo. So I wonder why you use the latter in combination with explicit getcpuids() calls instead.

oschulz commented 1 month ago

ThreadPinning.jl should be enough. It re-exports all the functions you're using

Had overlooked that - thanks, I've updated the example above.

As for sysinfo() I just wanted to use the shiny new functionality. :-) But you're right, threadinfo() is more detailed.

carstenbauer commented 1 month ago

I could reproduce this on Perlmutter (no efficiency cores):

crstnbr@login22 ThreadPinning.jl git:(main)
➜ OPENBLAS_NUM_THREADS=6 julia --project -t 6 -q
julia> using ThreadPinning

julia> pinthreads(:cores)

julia> getcpuids() |> print
[0, 1, 2, 3, 4, 5]
julia> openblas_pinthreads([128, 129, 130, 131, 132, 133]) # hyperthreads in the same cores

julia> openblas_getcpuids() |> print
[128, 129, 130, 131, 132, 133]
julia> getcpuids() |> print
[133, 1, 2, 3, 4, 5]
carstenbauer commented 1 month ago

However, my gut feeling tells me that it is not a problem with ThreadPinning.jl but a fundamental/upstream issue. Will investigate.

oschulz commented 1 month ago

Will investigate

Thanks!

carstenbauer commented 1 month ago

Goes both ways...

crstnbr@login22 ThreadPinning.jl git:(main)
➜ OPENBLAS_NUM_THREADS=6 julia --project -t 6 -q
julia> using ThreadPinning

julia> openblas_pinthreads([128, 129, 130, 131, 132, 133]) # hyperthreads only

julia> openblas_getcpuids() |> print
[128, 129, 130, 131, 132, 133]
julia> pinthreads(:cores)

julia> getcpuids() |> print
[0, 1, 2, 3, 4, 5]
julia> openblas_getcpuids() |> print
[128, 129, 130, 131, 132, 0]

and isn't related to hyperthreads and/or efficiency cores.

crstnbr@login22 ThreadPinning.jl git:(main)
➜ OPENBLAS_NUM_THREADS=6 julia --project -t 6 -q
julia> using ThreadPinning

julia> pinthreads(cores(1:6))

julia> getcpuids() |> print
[0, 1, 2, 3, 4, 5]
julia> openblas_pinthreads(cores(7:12))

julia> openblas_getcpuids() |> print
[6, 7, 8, 9, 10, 11]
julia> getcpuids() |> print
[11, 1, 2, 3, 4, 5]
oschulz commented 1 month ago

Update: Added what happens to the OpenBLAS threads, when re-pinning the Julia threads, to the example above.

oschulz commented 1 month ago

Indeed it doesn't seem to have anything to do with HT/non-HT or the CPU numbers chosen, trying to pin the Julia threads and the OpenBLAS threads to non-overlapping sets of CPU threads in general seems to always result in the behavior above.

carstenbauer commented 1 month ago

Speculation: Maybe the reason for this issue lies in https://github.com/OpenMathLib/OpenBLAS/blob/d92cc96978c17a35355101a1901981970dec25b6/driver/others/blas_server.c#L357-L359. Maybe the call to pthread_self() is problematic because the calling Julia thread is also a pthread.

carstenbauer commented 1 month ago

Varying which Julia thread makes the openblas_setaffinity call:

julia> using ThreadPinning

julia> openblas_pinthreads([128, 129, 130, 131, 132, 133]; juliathreadid=2) # hyperthreads only

julia> openblas_getcpuids() |> print
ERROR: The affinity mask of OpenBLAS thread 6 includes multiple CPU threads. This likely indicates that this OpenBLAS hasn't been pinned yet.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] openblas_getcpuid(; threadid::Int64, juliathreadid::Int64)
   @ ThreadPinningCore.Internals /pscratch/sd/c/crstnbr/.julia/packages/ThreadPinningCore/fdkhT/src/openblas.jl:70
 [3] openblas_getcpuid
   @ /pscratch/sd/c/crstnbr/.julia/packages/ThreadPinningCore/fdkhT/src/openblas.jl:59 [inlined]
 [4] openblas_getcpuids(; kwargs::@Kwargs{})
   @ ThreadPinningCore.Internals /pscratch/sd/c/crstnbr/.julia/packages/ThreadPinningCore/fdkhT/src/openblas.jl:80
 [5] openblas_getcpuids
   @ /pscratch/sd/c/crstnbr/.julia/packages/ThreadPinningCore/fdkhT/src/openblas.jl:76 [inlined]
 [6] openblas_getcpuids()
   @ ThreadPinning.Querying /pscratch/sd/c/crstnbr/ThreadPinning.jl/src/querying.jl:317
 [7] top-level scope
   @ REPL[3]:1

julia> pinthreads(:cores)

julia> getcpuids() |> print
[0, 1, 2, 3, 4, 5]
julia> openblas_getcpuids() |> print
[128, 129, 130, 131, 132, 0]

Going to a lower level, trying to isolate the issue:

julia> using ThreadPinning

julia> import ThreadPinningCore: LibCalls

julia> cpuset_ref = Ref{LibCalls.Ccpu_set_t}(LibCalls.Ccpu_set_t([0]));

julia> LibCalls.openblas_setaffinity(5, sizeof(cpuset_ref[]), cpuset_ref)
0

julia> cpuset_ref = Ref{LibCalls.Ccpu_set_t}(LibCalls.Ccpu_set_t([1]));

julia> LibCalls.openblas_getaffinity(5, sizeof(cpuset_ref[]), cpuset_ref)
0

julia> cpuset_ref
Base.RefValue{ThreadPinningCore.LibCalls.Ccpu_set_t}(Ccpu_set_t(1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000))

# restart session (to be safe)

julia> using ThreadPinning

julia> import ThreadPinningCore: LibCalls

julia> cpuset_ref = Ref{LibCalls.Ccpu_set_t}(LibCalls.Ccpu_set_t([0]));

julia> ThreadPinning.@fetchfrom 1 LibCalls.openblas_setaffinity(5, sizeof(cpuset_ref[]), cpuset_ref)
0

julia> cpuset_ref = Ref{LibCalls.Ccpu_set_t}(LibCalls.Ccpu_set_t([1]));

julia> LibCalls.openblas_getaffinity(5, sizeof(cpuset_ref[]), cpuset_ref)
0

julia> cpuset_ref
Base.RefValue{ThreadPinningCore.LibCalls.Ccpu_set_t}(Ccpu_set_t(1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000))

# restart session (to be safe)

julia> using ThreadPinning

julia> import ThreadPinningCore: LibCalls

julia> cpuset_ref = Ref{LibCalls.Ccpu_set_t}(LibCalls.Ccpu_set_t([0]));

julia> ThreadPinning.@fetchfrom 2 LibCalls.openblas_setaffinity(5, sizeof(cpuset_ref[]), cpuset_ref)
0

julia> cpuset_ref = Ref{LibCalls.Ccpu_set_t}(LibCalls.Ccpu_set_t([1]));

julia> LibCalls.openblas_getaffinity(5, sizeof(cpuset_ref[]), cpuset_ref)
0

julia> cpuset_ref
Base.RefValue{ThreadPinningCore.LibCalls.Ccpu_set_t}(Ccpu_set_t(1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000))
carstenbauer commented 1 month ago

@vchuravy: Do you have any ideas what might be going on here?