Array Constructors - `similar`

I think we are in a good place to start working on array constructors (would also make documenting examples a whole lot easier). This brings up stuff related to similar:

I'm not convinced there's a silver bullet for array constructors, but I think we could at least find a solution to what similar is often trying to do, allow generic method definition without worrying about array specific constructors. I think we actually need to break up what similar does into several pieces though

allocate memory - a buffer/Ref that is safer than a pointer
do something to the memory - probably fastest on pointers
create user facing instance - e.g., Array

I haven't worked out all the details but here's some of the cleaner code I have so far that might support this.

""" allocate_memory(x, args...) """  # step 1 allocates memory with corresponding device to `x` and axes
function allocate_memory(x)  end

""" preserve(f, x, y) """  # step 2 operate on pointers and permit novel garbage collection approaches
preserve(f, x::X) where {X} = preserve(f, x, device(X))
preserve(f, x::X, ::CPUIndex) where {X} = f(x)
function preserve(f, x::X, ::CPUPointer) where {X}
    GC.@preserve x out = f(pointer(x))
    return out
end
preserve(f, x::X, y::Y) where {X,Y} = preserve(p_x -> preserve(Base.Fix1(op, p_x), y), x)

""" as_immutable """
function as_immutable(x) end

""" initialize(original, new_data) """ # step 3 turn processed buffer into user facing array
function initialize(x::X, data) where {X}
    if !ismutable(X)
        return as_immutable(data)
    else
        return data
    end
end

""" instantiate(f, x, allocator, initiializer) """
instantiate(f, x, allocator, initiializer) = initiializer(x, preserve(f, allocator(x),  x))
instantiate(f, x, allocator) = (f, x, allocator, initiialize)
instantiate(f, x) = (f, x, allocate_memory)

The reason I think separating things out like this is helpful is because it turns this function from base like this

function rotr90(A::AbstractMatrix)
    ind1, ind2 = axes(A)
    B = similar(A, (ind2,ind1))
    m = first(ind1)+last(ind1)
    for i=ind1, j=axes(A,2)
        B[j,m-i] = A[i,j]
    end
    return B
end

into this

function rotr90(A)
    ind1, ind2 = axes(A)
    return instantiate(A, x -> allocate_memory(x, (ind2, ind1))) do a, b
        m = first(ind1)+last(ind1)
        for i=ind1, j=axes(A,2)
            b[j,m-i] = a[i,j]
        end
    end
end

This means that new array types typically wouldn't need to change rotr90 but would just change their allocators and initializers. I'm not super familiar with how jagged arrays work but we could have allocate_memory take in device and memory layout info for this.

Memory management is pretty central to performance, so it'd be great if we can find a way to make optimizations easy.

As a brief aside, StrideArrays.jl is implemented as:

PtrArray -- a struct with two fields, (a) VectorizationBase.stridedpointer and (b) a size tuple. These two together support all the stridelayout interface. That is, a PtrArray can represent arbitrary PermutedDimsArrays/transposes, OffsetArrays, views, etc, and these can all be statically known.
StrideArray -- a struct with two fields, (a) PtrArray and (b) memory. If the PtrArray is statically sizes, the memory defaults to:
```
mutable struct MemoryBuffer{L,T} <: DenseVector{T}
data::NTuple{L,T}
@inline function MemoryBuffer{L,T}(::UndefInitializer) where {L,T}
    @assert isbitstype(T) "Memory buffers must point to bits types, but `isbitstype($T) == false`."
    new{L,T}()
end
end
```
Which Julia's compiler will stack allocate if it doesn't escape. If dynamically sized, the memory is a Vector. If you take any sort of view of a StrideArray (e.g., view, adjoint, permutedims), it creates a new PtrArray, and then a new StrideArray holds the PtrArray and the same memory.

The memory field could be anything, so long as it's memory we can get a pointer to, and preferably plays well with Julia's compiler and GC. mutable struct and Vector seemed like good defaults for static and dynamic arrays, respectively.

Something that is needed here however is the ability to extract the memory. If you GC.@gcpreserve a StrideArray, it will allocate. But not if you GC.@gc_preserve the MemoryBuffer above. Similarly, GC.@gc_preseve on a view or any other struct wrapping a MemoryBuffer will probably allocate.

Hence, to expressly support stack allocation of statically sized mutables, we need to make extracting the minimal memory object a part of the API.

Something else that I'd really like, but would need to actually spend some time with AbstractInterpreter or other tools before seeing what it'd take to make an API convenient, is the ability to swap out memory allocators contextually.

For example, maybe we have hot code where we know arrays don't escape. But perhaps arrays are fairly large, and/or passed to lots of functions we don't want to @inline. It'd be great if rather than relying on the compiler stack allocating, we could substitute our own allocator. Maybe we can use knowledge specific to the problem to make it much more efficient, or just do something super simple -- and super fast -- like just making our own stack:

const MYSTACK = Libc.malloc(1 << 30); # 1 GiB
function myfunction_using_custom_stack(args...)
    mystack = MYSTACK
    # every allocation in this function and functions it calls uses `mystack` and then increments by the amount of memory used
    ...
end

this would make writing "in place" code much easier.

I don't really know how to make this easy.

Would also be nice to finally get SArray support in LoopVectorization. The most obvious way of course is to convert to MArray, but perhaps a way to talk about doing that generally can come from this. Also of course important to differentiate things like SArray vs ranges, which CPUIndex doesn't do currently. We of course want to handle both differently.

Something that is needed here however is the ability to extract the memory. If you GC.@gcpreserve a StrideArray, it will allocate. But not if you GC.@gc_preserve the MemoryBuffer above. Similarly, GC.@gc_preseve on a view or any other struct wrapping a MemoryBuffer will probably allocate.

I was trying to solve this with preserve(f, data) that allows different garbage collection and pointer constructors per type and device; but I haven't tested performance of any of this b/c I'm just trying to get a sane design at this point.

Would also be nice to finally get SArray support in LoopVectorization.

This is actually the big thing preventing me from getting generic StaticArray and StridedArrays indexing support. I have some code that nicely goes back and forth between cartesian and linear elements, but array reconstruction is just a mess right now.

I've been playing around with a generic set of buffers based off of code like MemoryBuffer too (I think I saw that particular one in Octavian.jl). I imagined the pointer produced by preserve would be a lot like PtrArray. I haven't thrown up any related PRs b/c I'm not sure if we want to directly create these buffers in ArrayInterface or not. I'm mostly on the fence because supporting anything other than a tuple of bits types is very complicated. I'd love to figure it out but it looks pretty complicated and someone more competent than me should probably implement that.

I was trying to solve this with preserve(f, data) that allows different garbage collection and pointer constructors per type and device

Yeah, we could define specific overloads for preserve. FWIW, VectorizationBase defines:

@inline preserve_buffer(A::AbstractArray) = A
@inline preserve_buffer(A::SubArray) = preserve_buffer(parent(A))
@inline preserve_buffer(A::PermutedDimsArray) = preserve_buffer(parent(A))
@inline preserve_buffer(A::Union{LinearAlgebra.Transpose,LinearAlgebra.Adjoint}) = preserve_buffer(parent(A))
@inline preserve_buffer(x) = x

Which a few other libraries including LoopVectorization and CheapThreads use for this. I'd suggest having preserve use preserve_buffer.

This is actually the big thing preventing me from getting generic StaticArray and StridedArrays indexing support. I have some code that nicely goes back and forth between cartesian and linear elements, but array reconstruction is just a mess right now.

I've always been split on these two approaches:

A # ::SArray
rA = Ref(A);
GC.@preserve rA begin
p = Base.unsafe_convert(Ptr{eltype(A)}, rA)
# can build a `VectorizationBase.StridedPointer` from this
end

A # ::SArray
struct PretendPointer{....all the parameters from ArrayInterface....} <: VectorizationBase.AbstractStridedPointer
 data::D
 offset::Int
end
# can add methods to support the interface, e.g. `vload` mapping to `getindex`

"2." has the advantage of being much easier to implement, in particular because it can be ignorant of all the details about what the data it's wrapping is, and doesn't need me to change any of the current code calling stridedpointer.

"1." has the advantage of probably being much better, but the compiler does a bad job of optimizing both of these. I'd think "1." would be easier to optimize, but it for some reason really likes to actually copy A's memory instead of just using the base pointer, and I don't know why. There's probably a way to tell LLVM "you don't have to", but I don't know what that is yet. "1." will also require a pattern of

sp, b = stridedpointer_and_preserve(A)
GC.@preserve b begin
    ...
end

where if A::SArray, then b = Ref(A), and sp is the pointer to b. A pattern like:

sp = stridedpointer(A)
b = preserve_buffer(A)
GC.@preserve b begin
    ...
end

won't work, since we need the Ref we converted to a Ptr, and neither would calling stridedpointer(b), because if A was any sort of wrapper like view or PermutedDimsArray, we want the stridedpointer of A.

I think this is fine, but something to keep in mind. Maybe I should already add these combined functions (meaning both stridedpointer_preserver and groupedstridedpointer_preserve [as an aside, I have a branch of LoopVectorization that is now using ArrayInterface.indices to recognize when strides of arrays match]), and recommend those as the preferred approach, since people should already be calling both.

Besides getting it so I can load from SArrays in @avx loops, it'd be good if the broadcast code could use this as well so that @avx on broadcasts could create SArrays.

(I think I saw that particular one in Octavian.jl).

Yeah, I copy pasted the above from Octavian.jl. StrideArrays.jl itself just has:

using Octavian
using Octavian: MemoryBuffer

Octavian uses it raw to allocate memory on the stack, without actually wrapping it in an array.

We could probably look to StaticArrays.MArray for adding support for non isbits, and maybe the check should be Base.allocatedinline instead.

I imagined the pointer produced by preserve would be a lot like PtrArray. I haven't thrown up any related PRs b/c I'm not sure if we want to directly create these buffers in ArrayInterface or not.

I'm not sure either. Octavian probably isn't the best place, because that'd be a heavy dependency to take on. LoopVectorization itself may also want to start using it eventually.

CheapThreads.jl also wants PtrArray, so the code for PtrArray is in StrideArraysCore.jl. So I do think it'd be nice to have some of the definitions that're used often be in a fairly base package.

But I think we should also be cautious about turning ArrayInterface into a kitchen sink. We already have one issue about it being too heavy to take on as a dependency.

On non-isbits, I agree that it's a low priority. We can just back them with regular Vectors.

I think we could make something agnostic to just using approach 1 or 2 so that when someone wants to go crazy with optimization they can dig in. I was thinking the first level of interaction for a lot of users would be calling instantiate(f, A) or something like you find with ntuple or map:

instantiate(A) do
    ...do stuff to a pointer/pseudo-pointer...
end

Internally buffer allocation and pointer constructors would just have sensible defaults that can then be changed. I'm still thinking through how to make it both composable and flexible though. I'll hack out some more ideas.

It might be worth having something like CPUBuffers.jl for concrete implementations of buffers and then having some of the interface stick around here. I'm not sure if that means we'll just define a function head here (e.g., function strided_pointer end). Some of this is pretty difficult to discuss without concrete examples. For example, I had some code that approached the SArray thing by creating a buffer similar to MemoryBuffer, converted it to a strided pointer, and then used initialize to dereference the buffer and wrap it in the SArray struct.

For a first step to start experimenting with it and checking performance, it'd be nice to have a way to distinguish between when we should take approach 1 vs 2, which is what #131 is meant to indicate. Then I can go ahead and add support to easily be able to run benchmarks and look at codegen.

What is instantiate passing to allocate_memory here? And inside the anonymous function, what are a vs A? Should it only pass the temporary that eventually gets converted into the final type (e.g., mutable -> SArray)?

function rotr90(A)
    ind1, ind2 = axes(A)
    return instantiate(A, x -> allocate_memory(x, (ind2, ind1))) do a, b
        m = first(ind1)+last(ind1)
        for i=ind1, j=axes(A,2)
            b[j,m-i] = a[i,j]
        end
    end
end

Here it looks like a === A, and that the function gets called once instead of eachindex(A) times like it would in map. Being able to specify the actual looping pattern is important for something like rotr90, and harder to specify with map-like APIs.

It might be worth having something like CPUBuffers.jl for concrete implementations of buffers and then having some of the interface stick around here.

I do think a combination of

Something like MemoryBuffer, for when we want to allocated a fixed amount of possibly stack allocated memory
Base.Ref, for when we want to convert an isbits struct into something we can get a pointer to.

covers most use cases. But just making this a small library is fine. We could probably add strided_pointer (to replace VectorizationBase.stridedpointer) and the current constructors there, leaving all the indexing code inside VectorizationBase since there's several thousand lines of it. Then people could define specialized methods for the types that need it. Maybe this also related to something like adding support for Tropical Numbers, which would be good to specify a stable API for to make it easier and not at risk of random breakage.

Should we ping anyone from StaticArrays to ask what they think about the memory allocation aspect in terms of converting between mutable and static representations?

What is instantiate passing to allocate_memory here? And inside the anonymous function, what are a vs A? Should it only pass the temporary that eventually gets converted into the final type (e.g., mutable -> SArray)?

TBH, allocate_memory was mostly just a lazy placeholder for a default memory allocating function. I think the following actually makes more sense as the fully defined instantiate method

function instantiate(f::Function, allocator::Function, preserver::Function, initializer::Function, args...)
    buffer = allocator()
    preserver(f, buffer, args...)
    return initializer(buffer)
end

So allocator could be () -> Libc.malloc(1 << 30)), () -> Ref{NTuple{Int,3}}(), etc.. The preserver function would be responsible for doing something like this:

buffer_ptr = maybe_pointer(buffer)
data_ptr = maybe_pointer(data)
GC.@preserve buffer,  data begin
    f(buffer_ptr, data_ptr)
end

And initializer would be responsible for dereferencing plus any other final stuff that needs to wrap or process before returning to user. BTW, I'm not married to any of the names.

I still need to put more thought into how we get from instantiate(f, data) to the fully defined function. I assumed that the the allocating function would create a buffer that contains the same memory size as the original data did. I'm not sure how much room we need to leave for swapping out garbage collectors, but if Julia's garbage collector is flexible enough to be aware of different devices then the default preserver could just be preserve(f, args...) = @gc_preserve f(args...).

Here it looks like a === A, and that the function gets called once instead of eachindex(A) times like it would in map.

It doesn't create a generator like map. My example with rotr90 was an exact copy from base just to illustrate how this approach could be more ergonomic than requiring something like

rotr90(x) = instantiate(_rotr90, x)
_rotr90(x_ptr) = ...

I know it seems a bit silly to focus on avoiding that extra line of could with _rotr90, but people really hate writing an extra function (I think this is why SimpleTraits is so popular).

I still need to put more thought into how we get from instantiate(f, data) to the fully defined function.

If we can get a size and eltype T, we could use that to pick Vector{T} if the size is dynamic, and Ref{NTuple{N,T}} if it's StaticInt{N}.

I know it seems a bit silly to focus on avoiding that extra line of could with _rotr90, but people really hate writing an extra function (I think this is why SimpleTraits is so popular).

I really would encourage people to write in-place functions though, and then wrap the in-place version to make the convenient-but-allocating version.

I asked @maleadt on slack about memory allocation on GPUs and he pointed me to the code here:

Concerning garbage collection he said "...we use Julia's GC and in kernels we currently don't really support dynamic allocations."

Which makes me think that we could rely on the @gc_preserve macro you have (which I like a lot more than anything I've come up with). Then we could have the default be gc_preserve(f, args...) = @gc_preserve f(args...) which leaves room for changing this based on the f call.

If we can get a size and eltype T, we could use that to pick Vector{T} if the size is dynamic, and Ref{NTuple{N,T}} if it's StaticInt{N}.

That's exactly what I was thinking. I don't think it would be unreasonable to expect users to do something like allocate_memory(x::AbstractArray, size::Integer).

JuliaArrays / ArrayInterface.jl

Array Constructors - `similar` #130