Dealing with large `NTuple`s

DanielVandH commented 3 months ago

One problem with defining methods for these predicates using Tuples is that some of the methods need extremely large Tuples. For example, orient3d needs a Tuple of length 196 and insphere needs one of length 5760 (for exact), 27648 (for slow), 1152 (for adapt). This is annoying since ntuple construction will start allocating for n >= 33. One approach is to define my own NTuple, using

abstract type AbstractLargeNTuple{T} end
function genconstructor(N)
    calls = [Expr(:call, :zero, :T) for _ in 1:N]
    name = Symbol(:LargeNTuple, N)
    return Expr(:(=),
        Expr(:where, Expr(:call, Expr(:curly, esc(name), :T)), :T),
        Expr(:block, Expr(:call, Expr(:curly, esc(name), :T), calls...))
    )
end
function deflargentuple(N)
    constructors = genconstructor(N)
    parameters = [:T]
    name = Symbol(:LargeNTuple, N)
    fields = [(Symbol(:x, i), :T) for i in 1:N]
    supertype = :(AbstractLargeNTuple{T})
    nameparam = :($name{$(parameters...)})
    header = :($nameparam <: $supertype)
    fields = map(fields) do field
        fieldname, typ = field
        :($fieldname::$typ)
    end
    body = quote
        $(fields...)
    end
    return Expr(:block,
        Expr(:struct, false, header, body),
        constructors
    )
end
function gensetindex(N)
    name = Symbol(:LargeNTuple, N)
    ifs = [Expr(:if,
        Expr(:call, :(==), :index, i),
        :val,
        Expr(:ref, :tup, i)
    ) for i in 1:N]
    return Expr(:(=),
        Expr(:call,
            Expr(:(.), :Base, QuoteNode(:setindex)), Expr(:(::), :tup, esc(name)), :val, Expr(:(::), :index, :Int)
        ),
        Expr(:call, esc(name), ifs...)
    )
end
Base.getindex(tup::AbstractLargeNTuple, i::Int) = getfield(tup, i)
macro LargeNTuple(N)
    def = deflargentuple(N)
    sidx = gensetindex(N)
    Expr(:block, def, sidx)
end
for N in (4, 8, 12, 16, 24, 48, 96, 192, 288, etc.)
    @LargeNTuple N
end

But I think that might put me in precompilation hell... I could use vectors, but that would also hurt performance a bit.

Maybe there is some other work around that tries to avoid vectors for as long as possible, although this is not so possible in insphere for example since a lot of computation is done with these large Tuples before any return is even possible. A bit annoying since most of the components in these expansions are never even used (e.g. for the 1152 case, I think only like <=20 of the components are even used and the rest are zero).

I'll have to try and get something working and then see what happens maybe. The full set of N we need NTuples for is

4, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 288, 384, 768, 1152, 2304, 3456, 4608, 5760, 6912, 13824, 27648

DanielVandH commented 3 months ago

Also thought about defining e.g. const vec1156_f64 = zeros(Float64, 1156) and const vec1156_f32 = zeros(Float32, 1156) as global mutable states, but that sounds like a terrible idea too since it also prevents the use of these predicates in parallel. Maybe defining nthreads copies of these states and using threadid works but I imagine there's something bad there too with switching threadids.

I could also just use this LargeNTuple approach for the adapt functions and use vectors for slow/exact since those aren't intended to be used anyway... @large NTuple 1152 might still be rough to use but not too bad? Including 27648 leads to a 198 second using time while including only up to 1152 gives a 3 second using time. No LargeNTuple is a using time of 1 second.

DanielVandH commented 3 months ago

Will implement a method using task_local_storage. See https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/Avoiding.20large.20.60NTuple.60s.20and.20global.20mutable.20caches

vchuravy commented 3 months ago

The goal is to guarantee that these variables are stack allocated. Originally StaticArrays.jl was used.

StaticAtrays can handle sizes above 32, but of course at some point the sizes get to large, but do note that the C code uses stack allocation exclusively.

I strongly doubt that playing any tricks like ta's local storage will be worth it. It almost always is faster and simpler to allocate memory

DanielVandH commented 3 months ago

I'll be setting it up in a way that'll make it easy to compare the approaches. It's straightforward to define something like

@static if isdefined(Base, :Memory)
    const Vec{T} = Memory{T}
else
    const Vec{T} = Vector{T}
end

const TASK_LOCAL_F64CACHE = Dict{Task, Dict{Tuple{UInt16, UInt8}, Vec{Float64}}}() 
# could just as well do task_local_storage() with type annotations I guess
const TASK_LOCAL_F32CACHE = Dict{Task, Dict{Tuple{UInt16, UInt8}, Vec{Float32}}}()
TASK_LOCAL_CACHE(::Type{Float64}) = TASK_LOCAL_F64CACHE 
TASK_LOCAL_CACHE(::Type{Float32}) = TASK_LOCAL_F32CACHE
task_local_cache(::Type{T}) where {T} = get!(TASK_LOCAL_CACHE(T), current_task()) do
    Dict{Tuple{UInt16, UInt8}, Vec{T}}()
end

function get_cache!(::Type{T}, size, id) where {T}
    tls = task_local_cache(T)
    return get!(tls, (size, id)) do 
        Vec{T}(zeros(T, Int(size))) # Memory{T}(undef, Int(size)) has weird concurrency issues sometimes?
    end
end

Then e.g. an InsphereCache could be defined as (could unroll it a bit less but it's sufficient)

struct InsphereCache{T} <: AbstractCache{T} # don't need so many fields if only using `insphere`, `inspherefast`, and `insphereadapt`
    h4::NTuple{4,T}
    h8::NTuple{8,T}
    h12::NTuple{12,T}
    h16::NTuple{16,T}
    h24::NTuple{24,T}
    h32::NTuple{32,T}
    h48_1::Vec{T}
    h48_2::Vec{T}
    h64_1::Vec{T}
    h64_2::Vec{T}
    h64_3::Vec{T}
    h96_1::Vec{T}
    h96_2::Vec{T}
    h96_3::Vec{T}
    h96_4::Vec{T}
    h96_5::Vec{T}
    h128::Vec{T}
    h192::Vec{T}
    h288_1::Vec{T}
    h288_2::Vec{T}
    h288_3::Vec{T}
    h288_4::Vec{T}
    h384_1::Vec{T}
    h384_2::Vec{T}
    h384_3::Vec{T}
    h384_4::Vec{T}
    h384_5::Vec{T}
    h384_6::Vec{T}
    h576_1::Vec{T}
    h576_2::Vec{T}
    h768_1::Vec{T}
    h768_2::Vec{T}
    h768_3::Vec{T}
    h768_4::Vec{T}
    h768_5::Vec{T}
    h768_6::Vec{T}
    h768_7::Vec{T}
    h768_8::Vec{T}
    h768_9::Vec{T}
    h1152_1::Vec{T}
    h1152_2::Vec{T}
    h1152_3::Vec{T}
    h1152_4::Vec{T}
    h1152_5::Vec{T}
    h1536_1::Vec{T}
    h1536_2::Vec{T}
    h1536_3::Vec{T}
    h2304_1::Vec{T}
    h2304_2::Vec{T}
    h2304_3::Vec{T}
    h3456::Vec{T}
    h4608::Vec{T}
    h5760::Vec{T}
    h6912_1::Vec{T}
    h6912_2::Vec{T}
    h6912_3::Vec{T}
    h6912_4::Vec{T}
    h13824_1::Vec{T}
    h13824_2::Vec{T}
    h27648::Vec{T}
end
@inline function InsphereCache{T}() where {T}
    h4 = ntuple(_ -> zero(T), Val(4))
    h8 = ntuple(_ -> zero(T), Val(8))
    h12 = ntuple(_ -> zero(T), Val(12))
    h16 = ntuple(_ -> zero(T), Val(16))
    h24 = ntuple(_ -> zero(T), Val(24))
    h32 = ntuple(_ -> zero(T), Val(32))
    h48_1 = get_cache!(T, 0x0030, 0x01)
    h48_2 = get_cache!(T, 0x0030, 0x02)
    h64_1 = get_cache!(T, 0x0040, 0x01)
    h64_2 = get_cache!(T, 0x0040, 0x02)
    h64_3 = get_cache!(T, 0x0040, 0x03)
    h96_1 = get_cache!(T, 0x0060, 0x01)
    h96_2 = get_cache!(T, 0x0060, 0x02)
    h96_3 = get_cache!(T, 0x0060, 0x03)
    h96_4 = get_cache!(T, 0x0060, 0x04)
    h96_5 = get_cache!(T, 0x0060, 0x05)
    h128 = get_cache!(T, 0x0080, 0x01)
    h192 = get_cache!(T, 0x00c0, 0x01)
    h288_1 = get_cache!(T, 0x0120, 0x01)
    h288_2 = get_cache!(T, 0x0120, 0x02)
    h288_3 = get_cache!(T, 0x0120, 0x03)
    h288_4 = get_cache!(T, 0x0120, 0x04)
    h384_1 = get_cache!(T, 0x0180, 0x01)
    h384_2 = get_cache!(T, 0x0180, 0x02)
    h384_3 = get_cache!(T, 0x0180, 0x03)
    h384_4 = get_cache!(T, 0x0180, 0x04)
    h384_5 = get_cache!(T, 0x0180, 0x05)
    h384_6 = get_cache!(T, 0x0180, 0x06)
    h576_1 = get_cache!(T, 0x0240, 0x01)
    h576_2 = get_cache!(T, 0x0240, 0x02)
    h768_1 = get_cache!(T, 0x0300, 0x01)
    h768_2 = get_cache!(T, 0x0300, 0x02)
    h768_3 = get_cache!(T, 0x0300, 0x03)
    h768_4 = get_cache!(T, 0x0300, 0x04)
    h768_5 = get_cache!(T, 0x0300, 0x05)
    h768_6 = get_cache!(T, 0x0300, 0x06)
    h768_7 = get_cache!(T, 0x0300, 0x07)
    h768_8 = get_cache!(T, 0x0300, 0x08)
    h768_9 = get_cache!(T, 0x0300, 0x09)
    h1152_1 = get_cache!(T, 0x0480, 0x01)
    h1152_2 = get_cache!(T, 0x0480, 0x02)
    h1152_3 = get_cache!(T, 0x0480, 0x03)
    h1152_4 = get_cache!(T, 0x0480, 0x04)
    h1152_5 = get_cache!(T, 0x0480, 0x05)
    h1536_1 = get_cache!(T, 0x0600, 0x01)
    h1536_2 = get_cache!(T, 0x0600, 0x02)
    h1536_3 = get_cache!(T, 0x0600, 0x03)
    h2304_1 = get_cache!(T, 0x0900, 0x01)
    h2304_2 = get_cache!(T, 0x0900, 0x02)
    h2304_3 = get_cache!(T, 0x0900, 0x03)
    h3456 = get_cache!(T, 0x0d80, 0x01)
    h4608 = get_cache!(T, 0x1200, 0x01)
    h5760 = get_cache!(T, 0x02400, 0x01)
    h6912_1 = get_cache!(T, 0x1b00, 0x01)
    h6912_2 = get_cache!(T, 0x1b00, 0x02)
    h6912_3 = get_cache!(T, 0x1b00, 0x03)
    h6912_4 = get_cache!(T, 0x1b00, 0x04)
    h13824_1 = get_cache!(T, 0x3600, 0x01)
    h13824_2 = get_cache!(T, 0x3600, 0x02)
    h27648 = get_cache!(T, 0x6c00, 0x01)
    return InsphereCache{T}(
        h4, h8, h12, h16, h24, h32, h48_1, h48_2,
        h64_1, h64_2, h64_3, h96_1, h96_2, h96_3, h96_4, h96_5,
        h128, h192, h288_1, h288_2, h288_3, h288_4,
        h384_1, h384_2, h384_3, h384_4, h384_5, h384_6,
        h576_1, h576_2, h768_1, h768_2, h768_3, h768_4, h768_5, h768_6, h768_7, h768_8, h768_9,
        h1152_1, h1152_2, h1152_3, h1152_4, h1152_5,
        h1536_1, h1536_2, h1536_3, h2304_1, h2304_2, h2304_3,
        h3456, h4608, h5760, h6912_1, h6912_2, h6912_3, h6912_4,
        h13824_1, h13824_2, h27648
    )
end

Some simplification might be able to be done to eliminate unnecessary caches. It'll be easy to check if replacing get_cache!(...) with a zeros(...) is faster, I don't see why it would be though. If it is faster then, for the large vectors, I'll just allocate them upon demand rather than defining caches and use NTuples for <=32.

JuliaGeometry / AdaptivePredicates.jl

Dealing with large `NTuple`s #13