MasonProtter / Bumper.jl

Bring Your Own Stack
MIT License
152 stars 6 forks source link

Performance for very small arrays #12

Closed mdmaas closed 10 months ago

mdmaas commented 10 months ago

Hi!

I'm testing various custom CPU arrays implementations in Julia, and comparing them with stack-allocated arrays and heap-allocated arrays in C.

https://gist.github.com/mdmaas/d1b6b1a69a6b235143d7110237ff4ae8

The test first allocates the inverse squares of integers from 1 to N, and then performs the sum.

This is how it looks like for Bumper.jl:

@inline function sumArray_bumper(N)
    @no_escape begin
        smallarray = alloc(Float64, N) 
        @turbo for i ∈ 1:N
            smallarray[i] = 1.0 / i^2
        end
        sum = 0.0
        @turbo for i ∈ 1:N
            sum += smallarray[i]
        end
    end
    return sum
end

I am focusing on values of N ranging from 3 to 100, as for larger values of N most implementations converge to similar values (about 10% overhead wrt C), with the exception of the regular Julia arrays, which are generally slower and thus require much larger values of N so the overhead is overshadowed by the actual use of memory.

My favourite method would be to use Bumper, as I think the API is great, but it is the slowest method of all I'm considering as alternatives to standard arrays: (manually pre-allocating a standard array, MallocArrays from StaticTools, and doing malloc in C). Standard arrays are of course slower than Bumper.

Am I doing something wrong? Do you think there could be a way to remove this overhead, and approach the performance of for example, pre-allocated regular arrays?

Best,

MasonProtter commented 10 months ago

Hi @mdmaas, sorry for the slow reply. The most important thing you can do to speed it up but would be to explicitly pass in the buffer you're using. Here's an example:

julia> @inline function sumArray_bumper_2(N, buf=default_buffer())
           @no_escape buf begin
               smallarray = alloc(Float64, buf, N) 
               @turbo for i ∈ 1:N
                   smallarray[i] = 1.0 / i^2
               end
               sum = 0.0
               @turbo for i ∈ 1:N
                   sum += smallarray[i]
               end
           end
           return sum
       end;

julia> let N = Ref(20)
           @btime sumArray_bumper($N[]) # Original version
           @btime sumArray_bumper_2($N[]) # Version where the buffer is aquired at runtime but then re-used
           @btime sumArray_bumper_2($N[], $(default_buffer())) # Version where the buffer is passed in ahead of time
       end;
  31.478 ns (0 allocations: 0 bytes)
  23.941 ns (0 allocations: 0 bytes)
  16.604 ns (0 allocations: 0 bytes)

The reason for this is that in order to safely acquire the buffer, we use a task local storage which has some overhead.