Madvise HUGEPAGE for large arrays

artemsolod commented 2 weeks ago

While playing around with numpy arrays from julia (discource thread) I have noticed that code like copy_np outperforms copy_jl (full benchmark here

using PythonCall
np = pyimport("numpy")

function copy_jl(arr)
    sum(copy(arr))
end

function copy_np(arr)
    pymem = np.empty(length(arr))
    pyarr = PyArray(pymem)
    pyarr .= arr
    ans = sum(pyarr)
    return ans
end

Further exploration revealed that disabling hugepages in numpy removes this gap ENV["NUMPY_MADVISE_HUGEPAGE"] = "0". Apparently numpy uses hugepages by default for large arrays (docs and source]). If I understand correctly /sys/kernel/mm/transparent_hugepage/enabled is normally not set to always by default but rather to madvise which requires the underlying code to explicitly request this functionality - I have checked on Ubuntu 24.10 and 22.04.

Can we have this in julia as well? Given people are routinely working with large arrays (and adoption in numpy) this will likely be useful at least as an option. I did some experiments and they indicate that manually constructing arrays from memory that was madvise'd for hugepage shows good improvement:

timings

julia +release -t 1 copyadd_hugepage.jl 
... precomps
  0.043465 seconds (3 allocations: 76.294 MiB) # pure julia
  0.043485 seconds (3 allocations: 76.294 MiB) # malloc
  0.021253 seconds (3 allocations: 76.294 MiB) # malloc + madivse(MADV_HUGEPAGE)

source

import Mmap: MADV_HUGEPAGE

function malloc_hugepage_vector_allocator(nels; hugepage=true)
    nbytes = (nels * 8)
    mlcd = @ccall pvalloc(nbytes::Csize_t)::Ptr{Cvoid} # depreceated pvalloc
    if hugepage
            success = @ccall madvise(mlcd :: Ptr{Cvoid}, nbytes :: Csize_t, MADV_HUGEPAGE :: Cint)  :: Cint
            @assert iszero(success)
    end
    arr_cons = @ccall jl_ptr_to_array_1d(
        Vector{Float64}::Any,
        mlcd::Ptr{Cvoid},
        nels::Csize_t,
        1::Cint # 1 triggers gc
    ) :: Vector{Float64}
    return arr_cons
end

function copyadd_jl(arr)
    arr1 = Vector{Float64}(undef, length(arr))
    arr1 .= arr
    return sum(arr)
end

function copyadd_hugepage(arr; hugepage=true)
    arr1 = malloc_hugepage_vector_allocator(length(arr); hugepage)
    arr1 .= arr
    return sum(arr)
end

GC.enable(false)
arr_test = ones(10_000_000)

@time copyadd_jl(arr_test)
@time copyadd_hugepage(arr_test; hugepage=false)
@time copyadd_hugepage(arr_test; hugepage=true)

@time copyadd_jl(arr_test)
@time copyadd_hugepage(arr_test; hugepage=false)
@time copyadd_hugepage(arr_test; hugepage=true)

system

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × INTEL(R) XEON(R) GOLD 6548N
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, icelake-server)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

d-netto commented 2 weeks ago

While there may be some performance wins from fewer TLB accesses and alike by using huge pages, the cost is the potential of increasing fragmentation for some workloads.

We actually went to the point of passing MADV_NOHUGEPAGE in a few places in our Julia fork at RelationalAI to avoid this: https://github.com/RelationalAI/julia/commit/2fbcf5ad301f04b1489ac3f62fa1c7447d184cf0.

artemsolod commented 2 weeks ago

While there may be some performance wins from fewer TLB accesses and alike by using huge pages, the cost is the potential of increasing fragmentation for some workloads.

Can this be circumvented somewhat by ensuring to only madvise(_, _, MADV_HUGEPAGE) when the memory requested is known to be large?

JuliaLang / julia

Madvise HUGEPAGE for large arrays #56521