BLAS threads should default to physical not logical core count?

ChrisRackauckas commented 4 years ago

On an i7-8550u, OpenBLAS is defaulting to 8 threads. I was comparing to RecursiveFactorizations.jl, and saw the performance is like:

using BenchmarkTools
import LinearAlgebra, RecursiveFactorization
ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
LinearAlgebra.BLAS.set_num_threads(4)
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5

luflop(m, n) = n^3÷3 - n÷3 + m*n^2
luflop(n) = luflop(n, n)

bas_mflops = Float64[]
rec_mflops = Float64[]
ns = 50:50:800
for n in ns
    A = rand(n, n)
    bt = @belapsed LinearAlgebra.lu!($(copy(A)))
    rt = @belapsed RecursiveFactorization.lu!($(copy(A)))
    push!(bas_mflops, luflop(n)/bt/1e9)
    push!(rec_mflops, luflop(n)/rt/1e9)
end

using Plots
plt = plot(ns, bas_mflops, legend=:bottomright, lab="OpenBLAS", title="LU Factorization Benchmark", marker=:auto)
plot!(plt, ns, rec_mflops, lab="RecursiveFactorization", marker=:auto)
xaxis!(plt, "size (N x N)")
yaxis!(plt, "GFLOPS")

1 thread

lubenchbnumthreads1

4 threads

lubenchbnumthreads4

8 threads

lubench

Conclusion: the default that Julia chooses, 8 threads, is the worst, with 1 thread doing better. But using the number of physical cores, 4, is best.

So there were a lot of issues on Discourse and Slack #gripes where essentially "setting BLAS threads to 1 is better than the default!", but it looks like it's because the default should be the number of physical and not logical threads. I am actually very surprised it's not set that way, and so I was wondering why, and also where this default is set (I couldn't find it).

ChrisRackauckas commented 4 years ago

Interestingly, on Linux I checked our benchmarking server and see:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
stepping        : 0
microcode       : 0x3d
cpu MHz         : 2394.455
cache size      : 35840 KB
physical id     : 30
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 30
initial apicid  : 30
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat spec_ctrl intel_stibp flush_l1d arch_capabilities
bogomips        : 4788.91
clflush size    : 64
cache_alignment : 64
address sizes   : 42 bits physical, 48 bits virtual
power management:

[crackauckas@pumas ~]$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0 (2019-08-20)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
8

and it's correct there, so this may be an issue with Windows.

YingboMa commented 4 years ago

This doesn't seem like a Windows-only problem.

julia> versioninfo(verbose=true)
Julia Version 1.3.0-rc2.0
Commit a04936e3e0 (2019-09-12 19:49 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 19.04
  uname: Linux 5.0.0-29-generic #31-Ubuntu SMP Thu Sep 12 13:05:32 UTC 2019 x86_64 x86_64
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz:
              speed         user         nice          sys         idle          irq
       #1  2700 MHz     384385 s       2674 s     128341 s    7458186 s          0 s
       #2  2700 MHz     381396 s       3631 s     125158 s    1466384 s          0 s
       #3  2700 MHz     304236 s       3488 s      99122 s    1470836 s          0 s
       #4  2700 MHz     421380 s       4702 s     137860 s    1454895 s          0 s
       #5  2700 MHz     192528 s       2218 s      83784 s    1472282 s          0 s
       #6  2700 MHz     267616 s       1661 s      82347 s    1470275 s          0 s
       #7  2700 MHz     309143 s       1778 s      98877 s    1478079 s          0 s
       #8  2700 MHz     287978 s       3788 s      85621 s    1470172 s          0 s

  Memory: 15.530269622802734 GB (4015.9765625 MB free)
  Uptime: 238530.0 sec
  Load Avg:  0.39208984375  0.30615234375  0.16015625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  HOME = /home/scheme
  OPTFLAGS = -load=/home/scheme/julia/usr/lib/libjulia.so
  PATH = /home/scheme/.cargo/bin:/home/scheme/Downloads/VSCode-linux-x64/code-insiders:/sbin:/home/scheme/.cargo/bin:/home/scheme/Downloads/VSCode-linux-x64/code-insiders:/sbin:/home/scheme/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
  TERM = screen-256color
  WINDOWPATH = 2

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
8

My Linux laptop has four physical cores.

tkf commented 4 years ago

Isn't 8 coming from here?

https://github.com/JuliaLang/julia/blob/318affa294e8d493b1b3edcc3e06df26c97eb4bb/base/Base.jl#L395-L403

Ref: #27856

macd commented 4 years ago

Here is the comparison of the 4 thread case on an Intel i7-6770HQ (Skull Canyon NUC) with using the Intel MKL plt

ChrisRackauckas commented 4 years ago

That shows? It seems RecursiveFactorizations is only faster when the BLAS is not well-tuned, but I don't understand how what you show is related @macd . Could you show the 8 thread version of MKL to see if MKL is also not smart with the setting being at logical?

macd commented 4 years ago

Here are the MKL results as a function of number of threads on the Skull Canyon NUC. It has 4 cores and 8 threads. Clearly a stumble with 8 threads around n = 500, but it is still better than 4. Apparently Intel knows how to use the hyper threads more effectively. mkl_threads

ChrisRackauckas commented 4 years ago

That explains a lot. Thanks

macd commented 4 years ago

So I have egg all over my face on this one. Just looked at the code in the window I left open this morning to see I incorrectly used an index rather than the indexed value for the number of threads. The above graph is then for 1,2,3,4 threads. The following graph shows no significant difference between 4 and 8 threads. (I never was an early morning person.) mkl_threads_v2

ChrisRackauckas commented 4 years ago

Ahh yes, so it looks like it's just smart and doesn't actually run more threads or something like that. Since OpenBLAS doesn't do that, this would account for a good chunk of the difference given the default settings.

StefanKarpinski commented 4 years ago

Can we easily detect the number of actual cores versus logical cores? We renamed the old Sys.CPU_CORES to Sys.CPU_THREADS fortunately, so we have the name Sys.CPU_CORES available for that value but I'm not sure how best to get it.

chriselrod commented 4 years ago

Is lscpu reliably on Linux machines? It correctly identified 10 cores per socket and 2 threads per core:

> lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              20
On-line CPU(s) list: 0-19
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
Stepping:            4
CPU MHz:             3980.712
CPU max MHz:         4500.0000
CPU min MHz:         1200.0000
BogoMIPS:            6600.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0-19
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

> lscpu | grep "Core(s) per socket"
Core(s) per socket:  10

StefanKarpinski commented 4 years ago

Some possibilities for figuring out the number of physical cores:

https://en.wikipedia.org/wiki/CPUID
https://c9x.me/x86/html/file_module_x86_id_45.html
https://stackoverflow.com/a/46201896/659248: smp_processor_id()

We may be able to borrow something from https://github.com/m-j-w/CpuId.jl

StefanKarpinski commented 4 years ago

Another possible library for figuring out the number of cores (from @RoyiAvital):

https://github.com/anrieff/libcpuid

KristofferC commented 4 years ago

Just as a note, we already do a bunch of cpuid stuff to support picking functions optimized for a feature-set at runtime e.g.

https://github.com/JuliaLang/julia/blob/cf544e6afab38351eeac049af1956153e24ec99c/src/processor_x86.cpp#L5-L28.

Perhaps extending that to pick out the number of cores instead of bundling a whole other library is more "clean".

tkf commented 4 years ago

I'm not very familiar with things this close to hardware. But, IIUC, CPUID is not the full story and I suppose you need to run the instruction on each core/socket. Does julia has a facility to do that in all supported platforms?

Also, I noticed that there is an issue tracked in libuv https://github.com/libuv/libuv/issues/1967 for adding the support for physical cores to uv_cpu_info (which is called via Sys.cpu_info). I guess it'd be ideal if it's supported by libuv.

BTW, It looks like uv_cpu_info just looks into OS's API (e.g., /proc/cpuinfo in Linux): https://github.com/JuliaLang/libuv/blob/julia-uv2-1.29.1/src/unix/linux-core.c https://github.com/JuliaLang/libuv/blob/julia-uv2-1.29.1/src/unix/darwin.c https://github.com/JuliaLang/libuv/blob/julia-uv2-1.29.1/src/win/util.c Maybe another option is to call such APIs directly from Julia? For example, parsing /proc/cpuinfo sounds very easy.

FYI: Meanwhile, I wrote a simple helper Python script to launch Julia with appropriate JULIA_CPU_THREADS: https://github.com/tkf/julia-t. It does something equivalent to lscpu -p | grep -v '^#' | sort --unique --field-separator=, --key=2 | wc -l to get the number of threads. It's useful to use this via Distributed.addprocs.

KristofferC commented 4 years ago

It seems MKL limits the maximum threads to the number of physical cores (i9-9900K (8 cores)):

julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
8

julia> ccall((:mkl_set_num_threads, "libmkl_rt"), Cvoid, (Ptr{Int32},), Ref(Int32(16)))

julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
8

julia> ccall((:mkl_set_num_threads, "libmkl_rt"), Cvoid, (Ptr{Int32},), Ref(Int32(4)))

julia> ccall((:mkl_get_max_threads, "libmkl_rt"), Int32, ())
4

jlperla commented 4 years ago

Is this stuff done automatically in Julia 1.5?

StefanKarpinski commented 4 years ago

Not that I'm aware of. No one ever figured out how to get the right number of threads.

jlperla commented 4 years ago

I am sure that this has already been suggested, but what if the is a temporary solution of just taking half of the current number because that would be the right calculation in most cases?

Even if AMD processors count logical cores different than intel (maybe they don't?...), it seems like it might be better to choose a default which matches the higher market share for now so that the MKL vs. OpenBLAS is more speed is more comparible. Others can always increase the number of cores if they wish.

tkf commented 4 years ago

No one ever figured out how to get the right number of threads.

I think it just has to be handled via whatever the API given by the OS. For example, in Linux, you'd want to make it cgroup-aware rather than using the hardware information directly.

$ sudo docker run --cpuset-cpus=0-2 -i -t --rm julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.4) pkg> add CpuId
...

julia> using CpuId
[ Info: Precompiling CpuId [adafc99b-e345-5852-983c-f28acb93d879]

julia> cpucores()
4

julia> cputhreads()
8

shell> cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-2

julia> Sys.CPU_THREADS
8

Here, using Threads.nthreads() == 3 with julia -t sounds like the best approach.

Parsing /proc and /sys sounds easy to do for Linux. I'm not sure about other OSes, though.

(Edit: fix the example to use --cpuset-cpus=0-2 instead of --cpuset-cpus=0-3)

StefanKarpinski commented 4 years ago

We (probably) don't want to make all of the CpuId package a dependency of Julia, so in order to use that, we would have to extract a minimal piece of it that lets us get the number of cores and use that.

Short of that, defaulting to half the number of CPU threads is probably better than saturating the cores. Of course, there are other computations than BLAS and they generally do benefit from hyperthreads, so maybe not. And if anyone has gone to the trouble of disabling hyperthreading, they're suddenly going to see their BLAS-intensive Julia code get half the FLOPS, so not great.

StefanKarpinski commented 4 years ago

Having libuv support this would be nice but since that's a feature request issue and there's no PR to implement it, whereas we already have Julia code that implements using CPUID to detect the number of cores and threads, it seems better to do that for now. If libuv ever adds the feature, we can always switch to using that instead.

StefanKarpinski commented 4 years ago

@KristofferC's observation that we already do a bunch of CPU detection stuff in C is actually the most promising avenue, imo: write C code that figures out the right number of CPU cores and threads and then expose that number as a C global or C function and set the relevant Julia globals and then use those. So someone "just" needs to write C code that does this now.

tkf commented 4 years ago

My point was that CPUID does not give us the correct information. I think it's better to parse /proc and/or /sys at least for Linux.

StefanKarpinski commented 4 years ago

How does it not agree? cpucores() == 4 and cat /sys/fs/cgroup/cpuset/cpuset.cpus shows for CPUs with IDs 0, 1, 2 and 3.

tkf commented 4 years ago

Ah, sorry, my example was completely misleading (I used Julia too much and cannot think in 0-origin any more...). I should have used something other than --cpuset-cpus=0-3. If I use --cpuset-cpus=0-2 (as in the updated example now), cpucores() still returns 4 while cat /sys/fs/cgroup/cpuset/cpuset.cpus shows 0-2. I can see that julia process uses only 3 CPUs if I do something like LinearAlgebra.peakflops().

StefanKarpinski commented 4 years ago

We can certainly handle Linux specially, but we need to do something on all platforms.

ViralBShah commented 4 years ago

Maybe we should ship https://github.com/JuliaParallel/Hwloc.jl with Julia?

StefanKarpinski commented 4 years ago

Hwloc is pretty heavyweight. Let's not add more binary dependencies for this. It seems like a very small piece of CpuId as a generic fallback to figure out the number of physical cores with special detection logic for specific platforms, (e.g. catting /proc or /sys on Linux) is the way to go. In the future as people propose better ways to figure out the right core counts on other platforms, we can add special cases for them.

oschulz commented 4 years ago

Not sure if this is of any help - I just ran into this issue on CoCalc. They use KVM virtual machines (in my case with 4 cores and 8 hyperthreads), and Julia uses 8 OpenBLAS threads. On other Linux systems, I've seen Julia using the number of physical cores for OpenBLAS, but not on all systems.

It seems coupled to the output of nproc - on the CoCalc VM, for example, nproc is the same as nproc --all, on other systems I've seen nproc give the number of cores and nproc --all the number of hyperthreads - but not on all systems: I've also seen nproc == nproc --all on some standard, non-virtualized Linux boxes with hyperthreading. Haven't been able to figure out why.

Btw., would it be possible to increase the number of maximum OpenBLAS threads Julia uses? On my Epyc-2 server (64 physical cores, 128 hyperthreads, single socket), I get ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()) == 32. I would love to see how OpenBLAS+Julia scales to the full 64 cores.

ViralBShah commented 4 years ago

I thought @staticfloat had a high-thread count openblas in BinaryBuilder. I'm not sure how to use it, but we should put those instructions in a prominent place.

ViralBShah commented 4 years ago

Clearly the current default of 16 is too small.

oschulz commented 4 years ago

Clearly the current default of 16 is too small.

Is it really 16? I do get 32 BLAS threads with Julia v1.4/v1.5 on that 64-core machine.

ViralBShah commented 4 years ago

Yeah, it is 32 in BB but different in deps/blas.jl. The https://github.com/JuliaPackaging/Yggdrasil/tree/master/O/OpenBLAS collection also has the high core count version built. Not sure how to use it.

oschulz commented 4 years ago

Is there a technical necessity to have a hard upper limit (don't know the technical depths, here), or could the user be allowed to set any `$OPENBLAS_NUM_THREADS$ value they like, in the future?

ViralBShah commented 4 years ago

It is a build time option for OpenBLAS, and we try to be conservative because it leads to significant memory consumption to have a default that is too high. I don't know if recent versions have fixed this, and if we can set the number to something big - like 256.

oschulz commented 4 years ago

Ah, I see! Grrr, why oh why does OpenBLAS need to know that at build time ... :-)

ViralBShah commented 4 years ago

Relevant: https://github.com/xianyi/OpenBLAS/blob/develop/USAGE.md

While eventually we should integrate openblas threading with Julia's, I wonder if we can solve the current issue by moving to openmp. That will have the benefit of also playing nicely with the existing openmp compiled librares in BB.

yuyichao commented 4 years ago

What thread scheduler to use does not seem to have anything to do with buffer allocation. That should be something to be fixed separately.

oschulz commented 4 years ago

I wonder if BLAS actually does some code generation based on the max. number of threads or so? I guess it's not a compile-time option on a whim?

yuyichao commented 4 years ago

No code generation. AFAICT it uses stack buffer for cheap and reentraint allocation. The easiest first step to try should be to just use VLA when supported.

jlperla commented 3 years ago

With Julia 1.6 I assume this is still an issue? Is there a current idiot-proof berst practice to tell users to copy/paste into their code to pick a better default value?

chriselrod commented 3 years ago

With Julia 1.6 I assume this is still an issue? Is there a current idiot-proof berst practice to tell users to copy/paste into their code to pick a better default value?

import Hwloc
ncores = Hwloc.num_physical_cores()

jlperla commented 3 years ago

@ChrisRackauckas Didn't you sometimes find that even a lower value was typically better or is my memory foggy?

ChrisRackauckas commented 3 years ago

That works. The issue is that Hwloc isn't what Julia uses. @ViralBShah's response was, shouldn't we just ship Hwloc with Julia then? @StefanKarpinski's response is that it's pretty heavyweight so it would be good to pull out just what we need. And that's where the issue has sat since. Someone just needs to put the work in.

chriselrod commented 3 years ago

OpenBLAS doesn't always ramp up well, especially for operations other than gemm (like LU), so if your matrices aren't that large it's often faster if you set it to a single thread. MKL doesn't have this problem, while it is pretty extreme in BLIS even for gemm. Asymptotically in matrix size, you want 1 thread per physical core.

For x86, we could also strip the code from CpuId.jl:

#
#   TODO:
#   The following llvmcall routines fail when being inlined!
#   Hence the @noinline.
#

# Low level cpuid call, taking eax=leaf and ecx=subleaf,
# returning eax, ebx, ecx, edx as NTuple(4,UInt32)
@noinline function cpuid_llvm(leaf::UInt32, subleaf::UInt32)
    Base.llvmcall("""
        ; leaf = %0, subleaf = %1, %2 is some label
        ; call 'cpuid' with arguments loaded into registers EAX = leaf, ECX = subleaf
        %3 = tail call { i32, i32, i32, i32 } asm sideeffect "cpuid",
            "={ax},={bx},={cx},={dx},{ax},{cx},~{dirflag},~{fpsr},~{flags}"
            (i32 %0, i32 %1) #2
        ; retrieve the result values and convert to vector [4 x i32]
        %4 = extractvalue { i32, i32, i32, i32 } %3, 0
        %5 = extractvalue { i32, i32, i32, i32 } %3, 1
        %6 = extractvalue { i32, i32, i32, i32 } %3, 2
        %7 = extractvalue { i32, i32, i32, i32 } %3, 3
        ; return the values as a new tuple
        %8  = insertvalue [4 x i32] undef, i32 %4, 0
        %9  = insertvalue [4 x i32]   %8 , i32 %5, 1
        %10 = insertvalue [4 x i32]   %9 , i32 %6, 2
        %11 = insertvalue [4 x i32]  %10 , i32 %7, 3
        ret [4 x i32] %11
    """
    # llvmcall requires actual types, rather than the usual (...) tuple
    , NTuple{4,UInt32}, Tuple{UInt32,UInt32}, leaf, subleaf)
end
function cpuid(leaf=0, subleaf=0)
    # for some reason, we need a dedicated local
    # variable of UInt32 for llvmcall to succeed
    l, s = UInt32(leaf), UInt32(subleaf)
    cpuid_llvm(l, s) ::NTuple{4,UInt32}
end

@inline function hasleaf(leaf::UInt32) ::Bool
    eax, ebx, ecx, edx = cpuid(leaf & 0xffff_0000)
    eax >= leaf
end
function cpucores() ::Int

    leaf = 0x0000_000b
    hasleaf(leaf) || return zero(UInt32)

    # The number of cores reported by cpuid is actually already the total
    # number of cores at that level, including all of the lower levels.
    # Thus, we need to find the highest level...which is 0x02 == "Core"
    # on ecx[15:08] per the specs, and divide it by the number of
    # 0x01 == "SMT" logical cores.

    sl = zero(UInt32)
    nc = zero(UInt32) # "Core" count
    nl = zero(UInt32) # "SMT" count
    while (true)
        # ebx[15:0] must be non-zero according to manual
        eax, ebx, ecx, edx = cpuid(leaf, sl)
        ebx & 0xffff == 0x0000 && break
        sl += one(UInt32)
        lt = ((ecx >> 8) & 0xff) & 0x03
        # Let's assume for now there's only one valid entry for each level
        lt == 0x01 && ( nl = ebx & 0xffff; continue )
        lt == 0x02 && ( nc = ebx & 0xffff; continue )
        # others are invalid and shouldn't be considered..
    end

    return iszero(nc) ? # no cores detected? then maybe its AMD?
        # AMD
        ((cpuid(0x8000_0008)[3] & 0x00ff)+1) :
        # Intel, we need nonzero values of nc and nl
        (iszero(nl) ? nc : nc ÷ nl)
end

This works:

julia> cpucores()
4

julia> versioninfo() # the 1165G7 has 4 cores
Julia Version 1.7.0-DEV.581
Commit d524f21917* (2021-02-19 22:06 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_NUM_THREADS = 8

I think Hwloc is clever about only reporting the number of cores a VM is given (IIRC Hwloc.num_physical_cores() returns 2 on GitHub CI, even though those servers have many more cores).

I haven't looked at it closely, but here is Hwloc's x86 code. Probably more rigorous and better tested than the above. The trick to being a cross-platform library like Hwloc is to have different code for every platform.

StefanKarpinski commented 3 years ago

I'm marking this for triage so that we can discuss if @chriselrod's implementation here is something we can go ahead with. This would be very nice to finally fix as this is a rather unfortunate performance trap to default to. While we're at it, perhaps we can discuss whether Julia should default to having as many threads as cores.

oschulz commented 3 years ago

A good default that would also suit larger/shared system might be default nthreads equal number of physical cores in a single numa domain. Not sure how easy it is to get that cross-platform without additional libs, though.

JeffBezanson commented 3 years ago

Triage says :+1: but we would like to implement this inside processor.cpp.

jlperla commented 2 years ago

Where did this end up in 1.7? Do we still need custom logic to choose processors?

JuliaLang / julia

BLAS threads should default to physical not logical core count? #33409