JuliaLinearAlgebra / Octavian.jl

Multi-threaded BLAS-like library that provides pure Julia matrix multiplication
https://julialinearalgebra.github.io/Octavian.jl/stable/
Other
226 stars 18 forks source link

Precompilation throws: `InexactError: check_top_bit(UInt64, -3141633)` #177

Open nathanaelbosch opened 1 year ago

nathanaelbosch commented 1 year ago

I wanted to use package that relies on Octavian, but the precompilation step fails in the following way:

julia> using Octavian
[ Info: Precompiling Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4]
ERROR: LoadError: InexactError: check_top_bit(UInt64, -3141633)
Stacktrace:
  [1] throw_inexacterror(f::Symbol, #unused#::Type{UInt64}, val::Int64)
    @ Core ./boot.jl:634
  [2] check_top_bit
    @ ./boot.jl:648 [inlined]
  [3] toUInt64
    @ ./boot.jl:759 [inlined]
  [4] UInt64
    @ ./boot.jl:789 [inlined]
  [5] convert
    @ ./number.jl:7 [inlined]
  [6] cconvert
    @ ./essentials.jl:492 [inlined]
  [7] malloc
    @ ./libc.jl:355 [inlined]
  [8] valloc
    @ ~/.julia/packages/VectorizationBase/0dXyA/src/alignment.jl:36 [inlined]
  [9] init_bcache
    @ ~/.julia/packages/Octavian/XhL0C/src/init.jl:19 [inlined]
 [10] __init__()
    @ Octavian ~/.julia/packages/Octavian/XhL0C/src/init.jl:3
 [11] macro expansion
    @ ~/.julia/packages/Octavian/XhL0C/src/Octavian.jl:80 [inlined]
 [12] macro expansion
    @ ~/.julia/packages/SnoopPrecompile/1XXT1/src/SnoopPrecompile.jl:119 [inlined]
 [13] top-level scope
    @ ~/.julia/packages/Octavian/XhL0C/src/Octavian.jl:77
 [14] include
    @ ./Base.jl:457 [inlined]
 [15] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::Nothing)
    @ Base ./loading.jl:2010
 [16] top-level scope
    @ stdin:2
in expression starting at /home/me/.julia/packages/Octavian/XhL0C/src/Octavian.jl:1
in expression starting at stdin:2
ERROR: Failed to precompile Octavian [6fd5a793-0b7e-452c-907f-f8bfe9c57db4] to "/home/me/julia/compiled/v1.9/Octavian/jl_1m6rbd".

This is on a compute cluster, so the error might be linked to the setup I suppose, but I don't know enough about such things to figure out how to solve this. Any pointers?

EDIT: This happens both on the new Julia 1.9.0, as well as on 1.8.5

chriselrod commented 1 year ago

This is on a compute cluster, so the error might be linked to the setup I suppose

It's probably not reading the cache sizes directly.

nathanaelbosch commented 1 year ago

Any idea on how I can fix this?

chriselrod commented 1 year ago

What do you get for

julia> using CPUSummary

julia> CPUSummary.cache_size(Val(1))
static(32768)

julia> CPUSummary.cache_size(Val(2))
static(1048576)

julia> CPUSummary.cache_size(Val(3))
static(1441792)

julia> using Hwloc

julia> Hwloc.cachesize()
(L1 = 32768, L2 = 1048576, L3 = 20185088)
nathanaelbosch commented 1 year ago
julia> CPUSummary.cache_size(Val(1))
static(32768)

julia> CPUSummary.cache_size(Val(2))
static(4194304)

julia> CPUSummary.cache_size(Val(3))
static(1048576)

julia> Hwloc.cachesize()
(L1 = 32768, L2 = 4194304, L3 = 16777216)
chriselrod commented 1 year ago

That all looks correct, except -- you have 4 MiB of L2 cache? What CPU are you on?

julia> versioninfo()
Julia Version 1.10.0-DEV.1254
Commit b9b8b38ec0 (2023-05-09 20:47 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 41 on 28 virtual cores
Environment:
  JULIA_PATH = @.
  LD_LIBRARY_PATH = /usr/local/lib/
  JULIA_NUM_THREADS = 28

julia> ccall(:jl_getpagesize, Int, ())
4096

and let's confirm jl_getpagesize is returning correctly.

chriselrod commented 1 year ago

It'd be easier for you to debug these yourself and tell me what is wrong. Why do we have an invalid call to malloc?

  [7] malloc
    @ ./libc.jl:355 [inlined]
  [8] valloc
    @ ~/.julia/packages/VectorizationBase/0dXyA/src/alignment.jl:36 [inlined]
  [9] init_bcache
    @ ~/.julia/packages/Octavian/XhL0C/src/init.jl:19 [inlined]

https://github.com/JuliaLinearAlgebra/Octavian.jl/blob/00d50b3fb270f23d7f94dedd261cd95a3fb25af3/src/init.jl#LL16C1-L27C4

function init_bcache()
  if bcache_count() ≢ Zero()
    if BCACHEPTR[] == C_NULL
      BCACHEPTR[] = VectorizationBase.valloc(
        Threads.nthreads() * second_cache_size() * bcache_count(),
        Cvoid,
        ccall(:jl_getpagesize, Int, ())
      )
    end
  end
  nothing
end

calls

function valloc(
  N::Union{Integer,StaticInt},
  ::Type{T} = Float64,
  a = max(register_size(), cache_linesize())
) where {T}
  # We want alignment to both vector and cacheline-sized boundaries
  size_T = max(1, sizeof(T))
  reinterpret(
    Ptr{T},
    align(reinterpret(UInt, Libc.malloc(size_T * N + a - 1)), a)
  )
end

https://github.com/JuliaSIMD/VectorizationBase.jl/blob/9174dcca731144935e438d44ba07f4e4ec3a66c6/src/alignment.jl#L29-L40

So

N = Threads.nthreads() * Octavian.second_cache_size() * Octavian.bcache_count()
a = ccall(:jl_getpagesize, Int, ())
N + a - 1

Seems to be negative.

You can copy paste the definitions of Octavian.bcache_count and Octavian.second_cache_size.

nathanaelbosch commented 1 year ago

Thanks a lot for your help, I really appreciate it.

That all looks correct, except -- you have 4 MiB of L2 cache? What CPU are you on?

versioninfo()
julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 2 on 64 virtual cores
Environment:
  JULIA_NUM_THREADS = auto
  JULIA_STACKTRACE_MINIMAL = true

and let's confirm jl_getpagesize is returning correctly.

jl_getpagesize
julia> ccall(:jl_getpagesize, Int, ())
4096

It'd be easier for you to debug these yourself and tell me what is wrong. So [...] Seems to be negative.

It is negative indeed! I get

julia> N + a - 1
-6287361

This line seems to be the issue: https://github.com/JuliaLinearAlgebra/Octavian.jl/blob/00d50b3fb270f23d7f94dedd261cd95a3fb25af3/src/global_constants.jl#L70 According to CPUSummary I have

julia> (CPUSummary.cache_size(second_cache()), CPUSummary.cache_size(first_cache()))
(static(1048576), static(4194304))

so the former minus the latter gives negative number.

You mentioned that the results I wrote ealier (https://github.com/JuliaLinearAlgebra/Octavian.jl/issues/177#issuecomment-1542093390) looked correct, but were they? The L3 numbers reported by CPUSummary and Hwloc are different ones, and in particular, if CPUSummary reported the Hwloc number, then this would not get negative (but again I really don't know anything about hardware so this might make no sense).

chriselrod commented 1 year ago

Cascadelake has 1 MiB of L2 cache/core. So the 4 MiB reported is wrong. Furthermore

julia> sc = Octavian.second_cache()
static(3)

julia> Octavian.cache_inclusive(sc)
static(false)

the cache is not inclusive either, so it shouldn't be subtracting.

CPUSummary is suposed to report per-core sizes, hence the discrepancy for the L3 cahce vs hwloc. Octavian also shouldn't be trying to use a greater allotment of cache than the number of threads it has, as it can't assume the other threads aren't busy working on something else (if they weren't, Octavian itself could/should've been multithreaded).

sloede commented 1 year ago

I am getting the same errors on my machine. It is not a compute cluster but a virtual machine by a large German cloud provider (Hetzner). I also get the reported numbers

julia> (CPUSummary.cache_size(second_cache()), CPUSummary.cache_size(first_cache()))
(static(2097152), static(4194304))

That is, the first cache is reported much larger then the second one and thus N is already negative.

@nathanaelbosch did you find a way to fix this issue for you? @chriselrod If I reach out to their support regarding this, what exactly should I tell them (preferably without having to rely on Julia terminology)? That their setup reports the wrong cache sizes for the L2 and L3 caches?

nathanaelbosch commented 1 year ago

@sloede Unfortunately I did not find a way to fix this. But I would be very interested in a solution to this issue.

chriselrod commented 1 year ago

As a workaround, we could hardcode values for certain architectures, e.g. check for

julia> Sys.CPU_NAME
"cascadelake"
sloede commented 1 year ago

Is the amount of cache fixed for certain architectures? In my case, they identify as Skylake, as far as I can tell

chriselrod commented 1 year ago

Yes. CPUSummary reports cache per core, so that Octavian can assume the total L3 cache it can use is proportional to the number of cores it is using. (If other cores are doing something else, they're likely to use/want some chunk of the L3 themselves.)

Skylake-avx512 and cascadelake are almost the same thing. They both have the same number of L1 and L2 cache per core: 32 KiB L1d, 32 KiB L1i, 1024 KiB L2.

They also have something like a 1.375 MiB L3 slice per core, shared among all cores. I'm assuming your server is skylake-avx512 rather than skylake?

sloede commented 1 year ago

I'm assuming your server is skylake-avx512 rather than skylake?

Yes, it looks like it - `julia -e 'using InteractiveUtils; versioninfo(verbose=true)' gives me the following:

Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 20.04.6 LTS
  uname: Linux 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64
  CPU: Intel Xeon Processor (Skylake, IBRS): 
              speed         user         nice          sys         idle          irq
       #1  2099 MHz      10207 s          0 s        968 s      67491 s          0 s
       #2  2099 MHz      10011 s          0 s       1001 s      67651 s          0 s
       #3  2099 MHz       9902 s          0 s        940 s      67830 s          0 s
       #4  2099 MHz      10310 s          0 s        969 s      67400 s          0 s
       #5  2099 MHz        242 s          0 s        210 s      78168 s          0 s
       #6  2099 MHz        250 s          0 s        200 s      78101 s          0 s
       #7  2099 MHz        235 s          0 s        212 s      78164 s          0 s
       #8  2099 MHz        265 s          8 s        210 s      78115 s          0 s
  Memory: 8.0 GB (7221.390625 MB free)
  Uptime: 7880.56 sec
  Load Avg:  0.54  0.12  0.04
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 8 virtual cores
Environment:
  GITHUB_PATH = /_work/github-runner-1-3/_temp/_runner_file_commands/add_path_cec3c7ec-60a7-4b35-842d-3fdcb4ffc5f2
  HOME = /root
  GITHUB_EVENT_PATH = /_work/github-runner-1-3/_temp/_github_workflow/event.json
  PATH = /opt/hostedtoolcache/julia/1.9.0/x64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/actions-runner