Open lkuper opened 9 years ago
+1 to shipping with hwloc. It's a pretty small and very useful library.
https://github.com/JuliaParallel/hwloc.jl for now (there was discussion of integrating it, in #6649 and elsewhere)
On Fri, Nov 6, 2015 at 3:44 PM, Lindsey Kuper notifications@github.com wrote:
Base.CPU_CORES reports the number of cores including hyperthreading, aka hardware threads. Looking at the discussion on the original commit https://github.com/JuliaLang/julia/commit/5733c5149c69ea5375e381b38bc21c67258e9c5e, @StefanKarpinski https://github.com/StefanKarpinski wondered if it's even possible to get number of cores without hyperthreading. Indeed, this is hard to do in a reasonably cross-platform way. But one option that works on "almost all" machines and OSes is to use the hwloc http://www.open-mpi.org/projects/hwloc/ library. (An example is here http://stackoverflow.com/a/12486105.) Would it be possible for Julia to integrate this? Perhaps there could be a CPU_PHYSICAL_CORES constant -- I don't want to mess up CPU_CORES for anyone who's relying on the current behavior.
Probably no library can be fully accurate cross-platform, but it's got to be better than writing Base.CPU_CORES/2!
— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/13901.
The package works as is. My vote is to keep it where it is, but consider it for inclusion in future bundles of default packages.
@ihnorton Nice -- I didn't know about Hwloc.jl.
I don't see why we need this in base. It could be helpful for the multiprocessing library, but that is also slated to exist as a package at some point.
One of the main applications is setting the default number of threads to use for multithreading. For that it would probably need to be part of the run time system.
I can't comment on Windows, but isn't it just reading "cpu-cores" from /proc/cpuinfo
on Linux and sysctl -n hw.physicalcpu
on OSX?
It would be great to have some sort of CPU_PHYSICAL_CORES shipped with Julia. I was having a big performance issue on Windows because of my incorrect interpretation of CPU_CORES: https://github.com/JuliaLang/julia/issues/16570
Here's a function that works on MacOS:
function physicalcores()
p = Ref{Int32}()
err = ccall(:sysctlbyname, Cint, (Cstring, Ref{Int32}, Ref{Csize_t}, Ptr{Void}, Csize_t), "hw.physicalcpu", p, sizeof(Cint), C_NULL, 0)
err < 0 && throw(SystemError("failed to retrieve hw.physicalcpu"))
return Int(p[])
end
Here's a function that works on Linux:
function cpuinfo_physicalcores()
maxcore = -1
for line in eachline("/proc/cpuinfo")
if startswith(line, "core id")
maxcore = max(maxcore, parse(Int, split(line, ':')[2]))
end
end
maxcore < 0 && error("failure to read core ids from /proc/cpuinfo")
return maxcore + 1
end
Reading from /proc/cpuinfo
is very unreliable. Even on x86, I've just seen a linode server reports
% cat /proc/cpuinfo | grep core
core id : 0
cpu cores : 1
core id : 0
cpu cores : 1
core id : 0
cpu cores : 1
core id : 0
cpu cores : 1
And the format is very different on non-x86
@yuyichao, is there a better way? Every source I've googled suggests /proc/cpuinfo
. We could use lscpu -p
, but I looked at the lscpu
source code and it seems to parse /proc/cpuinfo
also.
I believe the following works on Windows:
immutable _SYSTEM_LOGICAL_PROCESSOR_INFORMATION
ProcessorMask::UInt
Relationship::Cint
data::UInt128
end
function physical_cores()
len = Ref{Int32}(0)
ccall(:GetLogicalProcessorInformation, Cint, (Ptr{Void}, Ref{Int32}), C_NULL, len)
s = sizeof(_SYSTEM_LOGICAL_PROCESSOR_INFORMATION)
procs = Array{_SYSTEM_LOGICAL_PROCESSOR_INFORMATION}(len[]÷s)
ok = ccall(:GetLogicalProcessorInformation, Cint,
(Ptr{_SYSTEM_LOGICAL_PROCESSOR_INFORMATION}, Ref{Int32}),
procs, sizeof(procs))
ok == 0 && error("failed to fetch processor info")
nprocs = 0
for i in eachindex(procs)
if procs[i].Relationship == 0 # RelationProcessorCore
nprocs += procs[i].data & 0xff == 1 ? 1 : count_ones(procs[i].ProcessorMask)
end
end
return nprocs
end
I've only tested it on a Windows VM, though, so if someone could test it on an actual Windows machine on which physical cores ≠ logical cores, that would be good.
lscpu -p
seems to work better (it might be parsing cpuinfo
but at least it works on non-x86). hwloc-info
also seems to always work
Can we rely on lscpu
and/or hwloc-info
always being installed on Linux systems? Are they part of the LSB?
I don't think we should call external commands for this. Bundling hwloc or copying its logic is the way to go IMHO. It also seems that we should just make Hwloc.jl a default package since we are already planing to move Parallel into a default package.
@stevengj the Windows code works properly (reports 4; have 8 logical).
I looked at the hwloc source code, and hwloc_linux_parse_cpuinfo
seems to be the key routine. The logic for pulling out the number of physical cores doesn't seem too bad; the only architecture-specific stuff seems to be the code for parsing the CPU model.
Also see https://github.com/m-j-w/CpuId.jl
cpucores() and cpucores_total() to determine the number of physical and logical cores on the currently executing CPU, which typically share L3 caches and main memory bandwidth. If the result of both functions is equal, then the CPU does not use of hyperthreading.
cc @m-j-w
CpuId.jl
Note that the package is x86-only and has a lot of wrong performance statements. Using cpuid
at runtime is very slow.
@yuyichao What do you consider "very slow" ?
Few hundred cycles on bare metal (+ it's serializing). Much worse in virtualized environments.
I measured for a single cpuid instruction including moving stuff in and out of registers around 190 cycles on a Skylake Xeon (average of 1 million repetitions). Agner Fog's instruction table gives similar cycle counts.
In virtualized environments there's no reliable way anyway to detect number and kind of cpu cores, since the hypervisor may pretend anything it wants (or anything the admin has configured). The virtualized operating system is often bound to same restrictions.
Either way, for something to be included into Julia Base, hwloc seems to be a good solution. In particular since maintaining something low-level like this in the necessary seriousness is quite some effort.
(and having re-read my statements in the readme of CpuId.jl, and having repeated some measurements, I'm still convinced they are correct.)
I didn't give much more detail since I didn't feel like randomly comment on a package I don't use. Though since the package author is here and this thread is kinda related here's a more detail explaination.
The x86 CPUID instruction is slow because it's serializing (maybe more complicated for virtualization). And it seems to be this way for historical reasons and is "regrettable since it's often needed when generating code that exploits version-specific features". Therefore, basically any serious use of CPUID only do so at initialization time and caches the result thereafter instead of calling it every time. It's kind of funny that a cache hit is faster than querying for CPU info that should be readily available on the CPU itself.................
As for the misleading part of CpuId.jl README, it correctly acknowledges that feature querying takes hundreds of cycles but the "wrong performance statement" actually refers to the operations it compares to. This includes
For comparison, 100..200 CPU cycles is roughly loading one integer from main memory
This is roughly the latency of a cache miss, which will be hidden to a large extent on an OOO core. You should also be able to get a high cache hit rate (unless you're writing a GC, for example....). A typical cache hit takes a few cycles to tens of cycles (of latency) depending on which cache you hit. Also, this is the timing of loading one cacheline of 64bytes, which usually consists of a few integers that's being used in the code.
or one or two integer divisions.
That's the combined latency of ~5-10 integer division or 10-20 reciprocal throughput.
Calling any external library function is at least one order more cycles.
If it's the first time you are calling a function and it needs to resolve the symbol it can take this much. A normal function call is at least one order faster. Comparing to symbol resolution won't be fair since that's effectively including JIT time....
Having CPUID available for hacking is certainly useful and that's certainly one way to compute a cached value for base julia if needed. (though there's already jl_cpuid
and we already use it for various stuff). However, it shouldn't directly be used for runtime check and especially not "moving such feature checks much closer or even directly in a hot zone". For any serious use, it need to provide a cached interface (or basically hwloc
-like).
Well, to put your criticism a little bit into context: You're referring to the readme of a 17 day old package, since yesterday at v0.1.9, where after a screen page long discussion and list of downsides and alternatives to the presented approach the middle section of the following paragraph offends you:
CpuId takes a different approach in that it talks directly to the CPU. For instance, asking the CPU for its number of cores or whether it supports AVX2 can be achieved in probably 250..500 CPU cycles, thanks to Julia's JIT-compilation approach and inlining. For comparison, 100..200 CPU cycles is roughly loading one integer from main memory, or one or two integer divisions. Calling any external library function is at least one order more cycles. This allows moving such feature checks much closer or even directly in a hot zone (which, however, might also hint towards a questionable coding pattern). Also, CpuId gives additional feature checks, such as whether your executing on a virtual machine, which again may or may not influence how you set up your high performance computing tasks in a more general way. Finally, the cpuid(...) function exposes this low-level interface to the users, enabling them to make equally fast and reliable run-time feature checks on new or other hardware.
Interested readers may e.g. refer to Agner Fog's instruction tables (pdf) for ample information on a wide variety of cpu architectures and to check the validity of the statements made; or, until such a feature as requested by the OP is available in Julia Base, to simply check whether the package CpuId.jl
solves their actual problem at hand, caching or not caching the results as they see fit.
Happy to respond to issues raised over there.
You're referring to the readme of a 17 day old package, since yesterday at v0.1.9, where after a screen page long discussion and list of downsides and alternatives to the presented approach the middle section of the following paragraph offends you:
The paragraph with the wrong information is a third of the "Alternative" section and the only paragraph discussion the "advantage" of CpuId
. A young age doesn't mean it must have wrong info in the README or that no one should be allowed to point it out.
Interested readers may e.g. refer to Agner Fog's instruction tables (pdf) for ample information on a wide variety of cpu architectures and to check the validity of the statements made; or, until such a feature as requested by the OP is available in Julia Base, to simply check whether the package CpuId.jl solves their actual problem at hand, caching or not caching the results as they see fit.
The point is that none of these is mentioned in the README and in fact the README effectively recommand against doing caching by providing the wrong performance figure. As I said, being able to use cpuid
is certainly useful, but it doesn't justify putting wrong info in README.
Happy to respond to issues raised over there.
Sure. I'll open an issue there since it'll be CpuId.jl
specific and not related to this thread anymore.
In the meantime, what do you think of renaming the misleading Sys.CPU_CORES
to Sys.CPU_LOGICAL_CORES
?
I am not the first person and will not be the last one to complain about performance issues when the actual issue is the fact that everyone interprets Sys.CPU_CORES
to mean physical cores by default. I had this package working beautifully on my Linux laptop unaware of the issue until another user on Windows came to me complaining that the simulation was taking a whole day.
If this change has to take place, it better happen now before Julia v0.7 with a big warning in the NEWS file.
What do you think of the proposal CPU_CORES ---> CPU_LOGICAL_CORES?
Also, is there any chance that we can get the number of physical cores in Base without relying on third-party packages? Hwloc.jl is currently broken for some versions of hwloc released on Mac computers.
So the situation has evolved slightly now, with the Sys
module, which exposes some things including Sys.CPU_THREADS
(which appears to be equivalent to the earlier Base.CPU_CORES
), Sys.CPU_NAME
, Sys.cpu_info
, Sys.cpu_summary
, and Sys.total_physical_memory
.
On Intel, both Sys.CPU_THREADS
and length(Sys.cpu_info())
return the number of virtual cores (or 2x the number of physical cores for the CPU I was testing on). Interestingly, on an Apple M1 with 8 "performance" cores plus 2 "efficiency" cores, Sys.CPU_THREADS == 8
while length(Sys.cpu_info()) == 10
.
Libraries like LoopVectorization that need more detailed information AFAIU use a mix of CpuId.jl
and Hwloc.jl
.
I don't know if this current state of affairs is sufficient to close the issue, or if there is still demand for more info than Base.Sys
provides, but figured an update was in order.
I thought I saw somewhere that we may be bringing Hwloc in as a dependency, in which case we can depend on it.
using Hwloc
segfaults Julia when run under wine, which is why LoopVectorization uses CpuId
instead.
AFAIK, supporting wine is no longer a priority, so I think I could switch back, but it's not been worth the effort.
CpuId.jl only supports AMD and Intel. Hwloc is much more cross platform.
Base.CPU_CORES
reports the number of cores including hyperthreading, aka hardware threads. Looking at the discussion on the original commit, @StefanKarpinski wondered if it's even possible to get number of cores without hyperthreading. Indeed, this is hard to do in a reasonably cross-platform way. But one option that works on "almost all" machines and OSes is to use the hwloc library. (An example is here.) Would it be possible for Julia to integrate this? Perhaps there could be aCPU_PHYSICAL_CORES
constant -- I don't want to mess upCPU_CORES
for anyone who's relying on the current behavior.Probably no library can be fully accurate cross-platform, but it's got to be better than writing
Base.CPU_CORES/2
!