`each_cache_line(A,R)` iteration for efficient multidimensional algorithms

timholy commented 9 years ago

Working on AxisAlgorithms led me to wonder if we might benefit from yet more fancy iterators. Let's consider writing a matrix-multiplication algorithm that's conceptually equivalent to (A*B')', i.e., multiplication along rows of B. Obviously, A_mul_Bt does all of this except the final transpose, but take this as a simple prototype for a multidimensional generalization, e.g., multiplying along dimension 4 of a 6-dimensional array. In all of this, A is a matrix.

It seems that the way to write a tiled, high-performance algorithm would be something like this (here C is the output array):

for I2 in R2  # R2 is a CartesianRange covering the "trailing" dimensions of B, e.g., dimensions 5 and 6
    for tilek = 1:8:size(B, dim)   # dim is the multiplication dimension
        for tilei = 1:8:size(C, dim)
            for I1 in each_cache_line(B, R1)   # R1 is a CartesianRange covering the "leading" dimensions 1-3
                for k = tilek:min(tilek+7,size(B,dim)), i = tilei:min(tilei+7,size(C,dim)
                    ...

I haven't thought this through very deeply and the above could be flawed, but the main point is that since you have to reuse the same values repeatedly that are scattered over various non-adjacent spots in memory, it makes sense to iterate over chunks that have the alignment and size of a cache line.

There are numerous issues to tackle, for example how should this generalize to StridedArrays, etc. Tackling this issue would be an ambitious task, but one that might yield sizable performance benefits.

This might be fun for anyone who thinks we should implement BLAS operations in julia. Over in my unpublished KernelTools.jl package, I found that a well-tuned matmatmul, at least for Float32, could get within spitting distance of single-threaded BLAS computations (largely thanks to @simd). So it's not a crazy thing to think about.

CC @ArchRobison, @lindahua

timholy commented 9 years ago

Might also be useful for FFT, CC @stevengj

ScottPJones commented 9 years ago

I've noticed that the BLAS library is larger than all of julia itself... some 55MB on my Mac. How fundamental is the BLAS library to core Julia operations? Watching all the compilation warnings building BLAS (and realizing that it's yet another source language core to building Julia, one that I'd hoped to have completely left being 40 years ago!), I think a pure Julia would be the cat's pyjamas! (not that I'd be able to contribute, people like you would need to do it). BTW, why is your KernelTools.jl package unpublished? Keeping the good stuff to yourself? About this PR, I think cache-line aware algorithms like you've proposed would be great for performance (at least, IMHO!)

timholy commented 9 years ago

why is your KernelTools.jl package unpublished? Keeping the good stuff to yourself?

Well, it's here, so no. I just haven't published/registered it because it still needs a lot of work. And the killer application will be combining the code analysis for the @tile macro with multithreading, but (1) julia's multithreading is still unfinished, and (2) I'll need to heavily refactor @tile to have it create independently-callable functions. I could probably make something work with SharedArrays, but there's a big part of me that just wants to wait for multithreading.

I've advertised it as a JSoC project, but no takers so far (that I know of).

ScottPJones commented 9 years ago

I forgot/didn't know the "GitHub" meaning of published... thanks for pointing me to that... more interesting Julian code to peruse and learn from!

timholy commented 9 years ago

I'm probably using it in a "julia package sense" and maybe that's just my own private meaning :smile:.

Jutho commented 9 years ago

@timholy, I am certainly interested in how you were able to get a matrix multiplication in Julia that's close to single threaded BLAS. I am interested in that as it would allow me to easily combine the microkernels of the matrix multiplication with the tensor contraction routines in TensorOperations.jl. I am working on a newer version of the latter but I won't have much time before the end of June.

I tried myself following this nice tutorial based on BLIS, but of course you don't get very far before needing something like SIMD types. I was also hoping that some automatic vectorization would kick in but was experimenting with Float64 and realized too late that this doesn't work (yet).

timholy commented 9 years ago

It's been a long time since I played with it, but this section is probably relevant. I don't think those were the tiling settings, though; I seem to remember 256-by-8 or something. The parameters turn out to matter quite a lot.

timholy commented 9 years ago

(and by "spitting distance" I meant 2-3x or something, i.e., still not a trivial gap.)

JuliaLang / julia

`each_cache_line(A,R)` iteration for efficient multidimensional algorithms #11523