Open Tokazama opened 3 years ago
Memory management is pretty central to performance, so it'd be great if we can find a way to make optimizations easy.
As a brief aside, StrideArrays.jl
is implemented as:
PtrArray
-- a struct with two fields, (a) VectorizationBase.stridedpointer
and (b) a size tuple. These two together support all the stridelayout interface. That is, a PtrArray
can represent arbitrary PermutedDimsArray
s/transposes, OffsetArray
s, views, etc, and these can all be statically known.StrideArray
-- a struct with two fields, (a) PtrArray
and (b) memory. If the PtrArray
is statically sizes, the memory defaults to:
mutable struct MemoryBuffer{L,T} <: DenseVector{T}
data::NTuple{L,T}
@inline function MemoryBuffer{L,T}(::UndefInitializer) where {L,T}
@assert isbitstype(T) "Memory buffers must point to bits types, but `isbitstype($T) == false`."
new{L,T}()
end
end
Which Julia's compiler will stack allocate if it doesn't escape.
If dynamically sized, the memory is a Vector
. If you take any sort of view of a StrideArray
(e.g., view
, adjoint
, permutedims
), it creates a new PtrArray
, and then a new StrideArray
holds the PtrArray
and the same memory.
The memory field could be anything, so long as it's memory we can get a pointer to, and preferably plays well with Julia's compiler and GC. mutable struct
and Vector
seemed like good defaults for static and dynamic arrays, respectively.
Something that is needed here however is the ability to extract the memory.
If you
GC.@gcpreserve
a StrideArray
, it will allocate. But not if you GC.@gc_preserve
the MemoryBuffer
above.
Similarly, GC.@gc_preseve
on a view
or any other struct wrapping a MemoryBuffer
will probably allocate.
Hence, to expressly support stack allocation of statically sized mutables, we need to make extracting the minimal memory object a part of the API.
Something else that I'd really like, but would need to actually spend some time with AbstractInterpreter or other tools before seeing what it'd take to make an API convenient, is the ability to swap out memory allocators contextually.
For example, maybe we have hot code where we know arrays don't escape. But perhaps arrays are fairly large, and/or passed to lots of functions we don't want to @inline
. It'd be great if rather than relying on the compiler stack allocating, we could substitute our own allocator. Maybe we can use knowledge specific to the problem to make it much more efficient, or just do something super simple -- and super fast -- like just making our own stack:
const MYSTACK = Libc.malloc(1 << 30); # 1 GiB
function myfunction_using_custom_stack(args...)
mystack = MYSTACK
# every allocation in this function and functions it calls uses `mystack` and then increments by the amount of memory used
...
end
this would make writing "in place" code much easier.
I don't really know how to make this easy.
Would also be nice to finally get SArray
support in LoopVectorization.
The most obvious way of course is to convert to MArray
, but perhaps a way to talk about doing that generally can come from this. Also of course important to differentiate things like SArray
vs ranges, which CPUIndex
doesn't do currently. We of course want to handle both differently.
Something that is needed here however is the ability to extract the memory. If you GC.@gcpreserve a StrideArray, it will allocate. But not if you GC.@gc_preserve the MemoryBuffer above. Similarly, GC.@gc_preseve on a view or any other struct wrapping a MemoryBuffer will probably allocate.
I was trying to solve this with preserve(f, data)
that allows different garbage collection and pointer constructors per type and device; but I haven't tested performance of any of this b/c I'm just trying to get a sane design at this point.
Would also be nice to finally get SArray support in LoopVectorization.
This is actually the big thing preventing me from getting generic StaticArray
and StridedArrays indexing support. I have some code that nicely goes back and forth between cartesian and linear elements, but array reconstruction is just a mess right now.
I've been playing around with a generic set of buffers based off of code like MemoryBuffer
too (I think I saw that particular one in Octavian.jl). I imagined the pointer produced by preserve
would be a lot like PtrArray
. I haven't thrown up any related PRs b/c I'm not sure if we want to directly create these buffers in ArrayInterface or not. I'm mostly on the fence because supporting anything other than a tuple of bits types is very complicated. I'd love to figure it out but it looks pretty complicated and someone more competent than me should probably implement that.
I was trying to solve this with preserve(f, data) that allows different garbage collection and pointer constructors per type and device
Yeah, we could define specific overloads for preserve
.
FWIW, VectorizationBase defines:
@inline preserve_buffer(A::AbstractArray) = A
@inline preserve_buffer(A::SubArray) = preserve_buffer(parent(A))
@inline preserve_buffer(A::PermutedDimsArray) = preserve_buffer(parent(A))
@inline preserve_buffer(A::Union{LinearAlgebra.Transpose,LinearAlgebra.Adjoint}) = preserve_buffer(parent(A))
@inline preserve_buffer(x) = x
Which a few other libraries including LoopVectorization
and CheapThreads
use for this.
I'd suggest having preserve
use preserve_buffer
.
This is actually the big thing preventing me from getting generic
StaticArray
and StridedArrays indexing support. I have some code that nicely goes back and forth between cartesian and linear elements, but array reconstruction is just a mess right now.
I've always been split on these two approaches:
A # ::SArray
rA = Ref(A);
GC.@preserve rA begin
p = Base.unsafe_convert(Ptr{eltype(A)}, rA)
# can build a `VectorizationBase.StridedPointer` from this
end
A # ::SArray
struct PretendPointer{....all the parameters from ArrayInterface....} <: VectorizationBase.AbstractStridedPointer
data::D
offset::Int
end
# can add methods to support the interface, e.g. `vload` mapping to `getindex`
"2." has the advantage of being much easier to implement, in particular because it can be ignorant of all the details about what the data it's wrapping is, and doesn't need me to change any of the current code calling stridedpointer
.
"1." has the advantage of probably being much better, but the compiler does a bad job of optimizing both of these. I'd think "1." would be easier to optimize, but it for some reason really likes to actually copy A
's memory instead of just using the base pointer, and I don't know why. There's probably a way to tell LLVM "you don't have to", but I don't know what that is yet.
"1." will also require a pattern of
sp, b = stridedpointer_and_preserve(A)
GC.@preserve b begin
...
end
where if A::SArray
, then b = Ref(A)
, and sp
is the pointer to b
.
A pattern like:
sp = stridedpointer(A)
b = preserve_buffer(A)
GC.@preserve b begin
...
end
won't work, since we need the Ref
we converted to a Ptr
, and neither would calling stridedpointer(b)
, because if A
was any sort of wrapper like view
or PermutedDimsArray
, we want the stridedpointer
of A
.
I think this is fine, but something to keep in mind.
Maybe I should already add these combined functions (meaning both stridedpointer_preserver
and groupedstridedpointer_preserve
[as an aside, I have a branch of LoopVectorization that is now using ArrayInterface.indices
to recognize when strides of arrays match]), and recommend those as the preferred approach, since people should already be calling both.
Besides getting it so I can load from SArray
s in @avx
loops, it'd be good if the broadcast code could use this as well so that @avx
on broadcasts could create SArray
s.
(I think I saw that particular one in Octavian.jl).
Yeah, I copy pasted the above from Octavian.jl. StrideArrays.jl itself just has:
using Octavian
using Octavian: MemoryBuffer
Octavian uses it raw to allocate memory on the stack, without actually wrapping it in an array.
We could probably look to StaticArrays.MArray
for adding support for non isbits
, and maybe the check should be Base.allocatedinline
instead.
I imagined the pointer produced by
preserve
would be a lot likePtrArray
. I haven't thrown up any related PRs b/c I'm not sure if we want to directly create these buffers in ArrayInterface or not.
I'm not sure either. Octavian probably isn't the best place, because that'd be a heavy dependency to take on. LoopVectorization itself may also want to start using it eventually.
CheapThreads.jl
also wants PtrArray
, so the code for PtrArray
is in StrideArraysCore.jl.
So I do think it'd be nice to have some of the definitions that're used often be in a fairly base package.
But I think we should also be cautious about turning ArrayInterface into a kitchen sink. We already have one issue about it being too heavy to take on as a dependency.
On non-isbits
, I agree that it's a low priority. We can just back them with regular Vector
s.
I think we could make something agnostic to just using approach 1 or 2 so that when someone wants to go crazy with optimization they can dig in. I was thinking the first level of interaction for a lot of users would be calling instantiate(f, A)
or something like you find with ntuple
or map
:
instantiate(A) do
...do stuff to a pointer/pseudo-pointer...
end
Internally buffer allocation and pointer constructors would just have sensible defaults that can then be changed. I'm still thinking through how to make it both composable and flexible though. I'll hack out some more ideas.
It might be worth having something like CPUBuffers.jl
for concrete implementations of buffers and then having some of the interface stick around here. I'm not sure if that means we'll just define a function head here (e.g., function strided_pointer end
). Some of this is pretty difficult to discuss without concrete examples. For example, I had some code that approached the SArray
thing by creating a buffer similar to MemoryBuffer
, converted it to a strided pointer, and then used initialize
to dereference the buffer and wrap it in the SArray
struct.
For a first step to start experimenting with it and checking performance, it'd be nice to have a way to distinguish between when we should take approach 1 vs 2, which is what #131 is meant to indicate. Then I can go ahead and add support to easily be able to run benchmarks and look at codegen.
What is instantiate
passing to allocate_memory
here?
And inside the anonymous function, what are a
vs A
?
Should it only pass the temporary that eventually gets converted into the final type (e.g., mutable -> SArray
)?
function rotr90(A)
ind1, ind2 = axes(A)
return instantiate(A, x -> allocate_memory(x, (ind2, ind1))) do a, b
m = first(ind1)+last(ind1)
for i=ind1, j=axes(A,2)
b[j,m-i] = a[i,j]
end
end
end
Here it looks like a === A
, and that the function gets called once instead of eachindex(A)
times like it would in map
.
Being able to specify the actual looping pattern is important for something like rotr90
, and harder to specify with map
-like APIs.
It might be worth having something like
CPUBuffers.jl
for concrete implementations of buffers and then having some of the interface stick around here.
I do think a combination of
MemoryBuffer
, for when we want to allocated a fixed amount of possibly stack allocated memoryBase.Ref
, for when we want to convert an isbits
struct into something we can get a pointer
to.covers most use cases.
But just making this a small library is fine.
We could probably add strided_pointer
(to replace VectorizationBase.stridedpointer
) and the current constructors there, leaving all the indexing code inside VectorizationBase
since there's several thousand lines of it. Then people could define specialized methods for the types that need it. Maybe this also related to something like adding support for Tropical Numbers, which would be good to specify a stable API for to make it easier and not at risk of random breakage.
Should we ping anyone from StaticArrays
to ask what they think about the memory allocation aspect in terms of converting between mutable and static representations?
What is instantiate passing to allocate_memory here? And inside the anonymous function, what are a vs A? Should it only pass the temporary that eventually gets converted into the final type (e.g., mutable -> SArray)?
TBH, allocate_memory
was mostly just a lazy placeholder for a default memory allocating function. I think the following actually makes more sense as the fully defined instantiate
method
function instantiate(f::Function, allocator::Function, preserver::Function, initializer::Function, args...)
buffer = allocator()
preserver(f, buffer, args...)
return initializer(buffer)
end
So allocator
could be () -> Libc.malloc(1 << 30))
, () -> Ref{NTuple{Int,3}}()
, etc..
The preserver
function would be responsible for doing something like this:
buffer_ptr = maybe_pointer(buffer)
data_ptr = maybe_pointer(data)
GC.@preserve buffer, data begin
f(buffer_ptr, data_ptr)
end
And initializer
would be responsible for dereferencing plus any other final stuff that needs to wrap or process before returning to user.
BTW, I'm not married to any of the names.
I still need to put more thought into how we get from instantiate(f, data)
to the fully defined function. I assumed that the the allocating function would create a buffer that contains the same memory size as the original data
did. I'm not sure how much room we need to leave for swapping out garbage collectors, but if Julia's garbage collector is flexible enough to be aware of different devices then the default preserver
could just be preserve(f, args...) = @gc_preserve f(args...)
.
Here it looks like a === A, and that the function gets called once instead of eachindex(A) times like it would in map.
It doesn't create a generator like map
. My example with rotr90
was an exact copy from base just to illustrate how this approach could be more ergonomic than requiring something like
rotr90(x) = instantiate(_rotr90, x)
_rotr90(x_ptr) = ...
I know it seems a bit silly to focus on avoiding that extra line of could with _rotr90
, but people really hate writing an extra function (I think this is why SimpleTraits
is so popular).
I still need to put more thought into how we get from instantiate(f, data) to the fully defined function.
If we can get a size and eltype T
, we could use that to pick Vector{T}
if the size is dynamic, and Ref{NTuple{N,T}}
if it's StaticInt{N}
.
I know it seems a bit silly to focus on avoiding that extra line of could with
_rotr90
, but people really hate writing an extra function (I think this is whySimpleTraits
is so popular).
I really would encourage people to write in-place functions though, and then wrap the in-place version to make the convenient-but-allocating version.
I asked @maleadt on slack about memory allocation on GPUs and he pointed me to the code here:
Concerning garbage collection he said "...we use Julia's GC and in kernels we currently don't really support dynamic allocations."
Which makes me think that we could rely on the @gc_preserve
macro you have (which I like a lot more than anything I've come up with).
Then we could have the default be gc_preserve(f, args...) = @gc_preserve f(args...)
which leaves room for changing this based on the f
call.
If we can get a size and eltype T, we could use that to pick Vector{T} if the size is dynamic, and Ref{NTuple{N,T}} if it's StaticInt{N}.
That's exactly what I was thinking. I don't think it would be unreasonable to expect users to do something like allocate_memory(x::AbstractArray, size::Integer)
.
I think we are in a good place to start working on array constructors (would also make documenting examples a whole lot easier). This brings up stuff related to
similar
:I'm not convinced there's a silver bullet for array constructors, but I think we could at least find a solution to what
similar
is often trying to do, allow generic method definition without worrying about array specific constructors. I think we actually need to break up what similar does into several pieces thoughI haven't worked out all the details but here's some of the cleaner code I have so far that might support this.
The reason I think separating things out like this is helpful is because it turns this function from base like this
into this
This means that new array types typically wouldn't need to change
rotr90
but would just change their allocators and initializers. I'm not super familiar with how jagged arrays work but we could haveallocate_memory
take in device and memory layout info for this.