Closed DanielVandH closed 3 months ago
Also thought about defining e.g. const vec1156_f64 = zeros(Float64, 1156)
and const vec1156_f32 = zeros(Float32, 1156)
as global mutable states, but that sounds like a terrible idea too since it also prevents the use of these predicates in parallel. Maybe defining nthreads
copies of these states and using threadid
works but I imagine there's something bad there too with switching threadids.
I could also just use this LargeNTuple
approach for the adapt
functions and use vectors for slow
/exact
since those aren't intended to be used anyway... @large NTuple 1152
might still be rough to use but not too bad? Including 27648
leads to a 198 second using
time while including only up to 1152
gives a 3 second using time. No LargeNTuple
is a using time of 1 second.
Will implement a method using task_local_storage
. See https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/Avoiding.20large.20.60NTuple.60s.20and.20global.20mutable.20caches
The goal is to guarantee that these variables are stack allocated. Originally StaticArrays.jl was used.
StaticAtrays can handle sizes above 32, but of course at some point the sizes get to large, but do note that the C code uses stack allocation exclusively.
I strongly doubt that playing any tricks like ta's local storage will be worth it. It almost always is faster and simpler to allocate memory
I'll be setting it up in a way that'll make it easy to compare the approaches. It's straightforward to define something like
@static if isdefined(Base, :Memory)
const Vec{T} = Memory{T}
else
const Vec{T} = Vector{T}
end
const TASK_LOCAL_F64CACHE = Dict{Task, Dict{Tuple{UInt16, UInt8}, Vec{Float64}}}()
# could just as well do task_local_storage() with type annotations I guess
const TASK_LOCAL_F32CACHE = Dict{Task, Dict{Tuple{UInt16, UInt8}, Vec{Float32}}}()
TASK_LOCAL_CACHE(::Type{Float64}) = TASK_LOCAL_F64CACHE
TASK_LOCAL_CACHE(::Type{Float32}) = TASK_LOCAL_F32CACHE
task_local_cache(::Type{T}) where {T} = get!(TASK_LOCAL_CACHE(T), current_task()) do
Dict{Tuple{UInt16, UInt8}, Vec{T}}()
end
function get_cache!(::Type{T}, size, id) where {T}
tls = task_local_cache(T)
return get!(tls, (size, id)) do
Vec{T}(zeros(T, Int(size))) # Memory{T}(undef, Int(size)) has weird concurrency issues sometimes?
end
end
Then e.g. an InsphereCache
could be defined as (could unroll it a bit less but it's sufficient)
struct InsphereCache{T} <: AbstractCache{T} # don't need so many fields if only using `insphere`, `inspherefast`, and `insphereadapt`
h4::NTuple{4,T}
h8::NTuple{8,T}
h12::NTuple{12,T}
h16::NTuple{16,T}
h24::NTuple{24,T}
h32::NTuple{32,T}
h48_1::Vec{T}
h48_2::Vec{T}
h64_1::Vec{T}
h64_2::Vec{T}
h64_3::Vec{T}
h96_1::Vec{T}
h96_2::Vec{T}
h96_3::Vec{T}
h96_4::Vec{T}
h96_5::Vec{T}
h128::Vec{T}
h192::Vec{T}
h288_1::Vec{T}
h288_2::Vec{T}
h288_3::Vec{T}
h288_4::Vec{T}
h384_1::Vec{T}
h384_2::Vec{T}
h384_3::Vec{T}
h384_4::Vec{T}
h384_5::Vec{T}
h384_6::Vec{T}
h576_1::Vec{T}
h576_2::Vec{T}
h768_1::Vec{T}
h768_2::Vec{T}
h768_3::Vec{T}
h768_4::Vec{T}
h768_5::Vec{T}
h768_6::Vec{T}
h768_7::Vec{T}
h768_8::Vec{T}
h768_9::Vec{T}
h1152_1::Vec{T}
h1152_2::Vec{T}
h1152_3::Vec{T}
h1152_4::Vec{T}
h1152_5::Vec{T}
h1536_1::Vec{T}
h1536_2::Vec{T}
h1536_3::Vec{T}
h2304_1::Vec{T}
h2304_2::Vec{T}
h2304_3::Vec{T}
h3456::Vec{T}
h4608::Vec{T}
h5760::Vec{T}
h6912_1::Vec{T}
h6912_2::Vec{T}
h6912_3::Vec{T}
h6912_4::Vec{T}
h13824_1::Vec{T}
h13824_2::Vec{T}
h27648::Vec{T}
end
@inline function InsphereCache{T}() where {T}
h4 = ntuple(_ -> zero(T), Val(4))
h8 = ntuple(_ -> zero(T), Val(8))
h12 = ntuple(_ -> zero(T), Val(12))
h16 = ntuple(_ -> zero(T), Val(16))
h24 = ntuple(_ -> zero(T), Val(24))
h32 = ntuple(_ -> zero(T), Val(32))
h48_1 = get_cache!(T, 0x0030, 0x01)
h48_2 = get_cache!(T, 0x0030, 0x02)
h64_1 = get_cache!(T, 0x0040, 0x01)
h64_2 = get_cache!(T, 0x0040, 0x02)
h64_3 = get_cache!(T, 0x0040, 0x03)
h96_1 = get_cache!(T, 0x0060, 0x01)
h96_2 = get_cache!(T, 0x0060, 0x02)
h96_3 = get_cache!(T, 0x0060, 0x03)
h96_4 = get_cache!(T, 0x0060, 0x04)
h96_5 = get_cache!(T, 0x0060, 0x05)
h128 = get_cache!(T, 0x0080, 0x01)
h192 = get_cache!(T, 0x00c0, 0x01)
h288_1 = get_cache!(T, 0x0120, 0x01)
h288_2 = get_cache!(T, 0x0120, 0x02)
h288_3 = get_cache!(T, 0x0120, 0x03)
h288_4 = get_cache!(T, 0x0120, 0x04)
h384_1 = get_cache!(T, 0x0180, 0x01)
h384_2 = get_cache!(T, 0x0180, 0x02)
h384_3 = get_cache!(T, 0x0180, 0x03)
h384_4 = get_cache!(T, 0x0180, 0x04)
h384_5 = get_cache!(T, 0x0180, 0x05)
h384_6 = get_cache!(T, 0x0180, 0x06)
h576_1 = get_cache!(T, 0x0240, 0x01)
h576_2 = get_cache!(T, 0x0240, 0x02)
h768_1 = get_cache!(T, 0x0300, 0x01)
h768_2 = get_cache!(T, 0x0300, 0x02)
h768_3 = get_cache!(T, 0x0300, 0x03)
h768_4 = get_cache!(T, 0x0300, 0x04)
h768_5 = get_cache!(T, 0x0300, 0x05)
h768_6 = get_cache!(T, 0x0300, 0x06)
h768_7 = get_cache!(T, 0x0300, 0x07)
h768_8 = get_cache!(T, 0x0300, 0x08)
h768_9 = get_cache!(T, 0x0300, 0x09)
h1152_1 = get_cache!(T, 0x0480, 0x01)
h1152_2 = get_cache!(T, 0x0480, 0x02)
h1152_3 = get_cache!(T, 0x0480, 0x03)
h1152_4 = get_cache!(T, 0x0480, 0x04)
h1152_5 = get_cache!(T, 0x0480, 0x05)
h1536_1 = get_cache!(T, 0x0600, 0x01)
h1536_2 = get_cache!(T, 0x0600, 0x02)
h1536_3 = get_cache!(T, 0x0600, 0x03)
h2304_1 = get_cache!(T, 0x0900, 0x01)
h2304_2 = get_cache!(T, 0x0900, 0x02)
h2304_3 = get_cache!(T, 0x0900, 0x03)
h3456 = get_cache!(T, 0x0d80, 0x01)
h4608 = get_cache!(T, 0x1200, 0x01)
h5760 = get_cache!(T, 0x02400, 0x01)
h6912_1 = get_cache!(T, 0x1b00, 0x01)
h6912_2 = get_cache!(T, 0x1b00, 0x02)
h6912_3 = get_cache!(T, 0x1b00, 0x03)
h6912_4 = get_cache!(T, 0x1b00, 0x04)
h13824_1 = get_cache!(T, 0x3600, 0x01)
h13824_2 = get_cache!(T, 0x3600, 0x02)
h27648 = get_cache!(T, 0x6c00, 0x01)
return InsphereCache{T}(
h4, h8, h12, h16, h24, h32, h48_1, h48_2,
h64_1, h64_2, h64_3, h96_1, h96_2, h96_3, h96_4, h96_5,
h128, h192, h288_1, h288_2, h288_3, h288_4,
h384_1, h384_2, h384_3, h384_4, h384_5, h384_6,
h576_1, h576_2, h768_1, h768_2, h768_3, h768_4, h768_5, h768_6, h768_7, h768_8, h768_9,
h1152_1, h1152_2, h1152_3, h1152_4, h1152_5,
h1536_1, h1536_2, h1536_3, h2304_1, h2304_2, h2304_3,
h3456, h4608, h5760, h6912_1, h6912_2, h6912_3, h6912_4,
h13824_1, h13824_2, h27648
)
end
Some simplification might be able to be done to eliminate unnecessary caches. It'll be easy to check if replacing get_cache!(...)
with a zeros(...)
is faster, I don't see why it would be though. If it is faster then, for the large vectors, I'll just allocate them upon demand rather than defining caches and use NTuple
s for <=32
.
One problem with defining methods for these predicates using
Tuple
s is that some of the methods need extremely largeTuple
s. For example,orient3d
needs aTuple
of length196
andinsphere
needs one of length5760
(forexact)
,27648
(forslow
),1152
(foradapt
). This is annoying sincentuple
construction will start allocating forn >= 33
. One approach is to define my ownNTuple
, usingBut I think that might put me in precompilation hell... I could use vectors, but that would also hurt performance a bit.
Maybe there is some other work around that tries to avoid vectors for as long as possible, although this is not so possible in
insphere
for example since a lot of computation is done with these largeTuple
s before any return is even possible. A bit annoying since most of the components in these expansions are never even used (e.g. for the1152
case, I think only like<=20
of the components are even used and the rest are zero).I'll have to try and get something working and then see what happens maybe. The full set of
N
we needNTuple
s for is