JuliaArrays / ArrayInterface.jl

Designs for new Base array interface primitives, used widely through scientific machine learning (SciML) and other organizations
https://docs.sciml.ai/ArrayInterface/stable/
MIT License
133 stars 36 forks source link

Boilerplate metrics #235

Open cscherrer opened 2 years ago

cscherrer commented 2 years ago

As I understand, the main goal of this is to easily express algorithms that can use either standard dispatch or generated functions, depending on what's known statically. Writing very general high-performance code for a given situation is always possible; the problem is that it can sometimes require lots of boilerplate.

Would it be helpful to make this explicit? A given proposal could be weighed based on its effect on reducing boilerplate for high-performance code that works for both static and dynamic cases. Maybe there could even be a small collection of running examples. Complete code isn't necessary for this, what matters is dispatch patterns and efficiently (in terms of human time) passing the right information for generated functions.

Tokazama commented 2 years ago

I'll try to respond in more detail later today when I have time, but for now I'd say the biggest issue is that we are missing basic tools for memory management that would allow us to generically write even simple methods

Tokazama commented 2 years ago

I'm going to start with a simple method that should help illustrate some of the problems I've run into.


using BenchmarkTools

memory_reference(A) = _memory_reference(A, parent(A))
_memory_reference(::A, b::A) where {A} = b
_memory_reference(::A, b::B) where {A,B} = memory_reference(b)

@inline function fast_copy(src::AbstractArray{T,3}) where {T}
    buf = memory_reference(src)
    stride_1, stride_2, stride_3 = strides(src)
    size_1, size_2, size_3 = size(src)
    dest = Array{T}(undef, size(src))
    dest_index = 1
    i3 = 0
    @inbounds while i3 < (size_3 * stride_3)
        i2 = 0
        while i2 < (size_2 * stride_2)
            i1 = 0
            while i1 < (size_1 * stride_1)
                dest[dest_index] = buf[i1 + i2 + i3 + 1]
                dest_index += 1
                i1 += stride_1
            end
            i2 += stride_2
        end
        i3 += stride_3
    end
    return dest
end

x = PermutedDimsArray(rand(3, 3, 3), (3, 1, 2));

@btime copy($x);  # 99.267 ns (1 allocation: 272 bytes)
@btime fast_copy($x); # 48.823 ns (1 allocation: 272 bytes)

Although fast_copy looks better than Base.copy, we have a few problems:

We really need all of these components in place to provide concise solutions to most problems. That doesn't mean that we can't do some really cool stuff with what's here right now. It's not all that complicated to implement a version of fast_copy for deeply nested static array that returns another static array. It just requires a relatively verbose generated function.

Tokazama commented 2 years ago

I've started putting together a more formal testing grounds for some related ideas I've been toying with here in case anyone is interested (well, more formal than intangible ideas).