Open timholy opened 9 years ago
Might also be useful for FFT, CC @stevengj
I've noticed that the BLAS library is larger than all of julia itself... some 55MB on my Mac. How fundamental is the BLAS library to core Julia operations? Watching all the compilation warnings building BLAS (and realizing that it's yet another source language core to building Julia, one that I'd hoped to have completely left being 40 years ago!), I think a pure Julia would be the cat's pyjamas! (not that I'd be able to contribute, people like you would need to do it). BTW, why is your KernelTools.jl package unpublished? Keeping the good stuff to yourself? About this PR, I think cache-line aware algorithms like you've proposed would be great for performance (at least, IMHO!)
why is your KernelTools.jl package unpublished? Keeping the good stuff to yourself?
Well, it's here, so no. I just haven't published/registered it because it still needs a lot of work. And the killer application will be combining the code analysis for the @tile
macro with multithreading, but (1) julia's multithreading is still unfinished, and (2) I'll need to heavily refactor @tile
to have it create independently-callable functions. I could probably make something work with SharedArrays, but there's a big part of me that just wants to wait for multithreading.
I've advertised it as a JSoC project, but no takers so far (that I know of).
I forgot/didn't know the "GitHub" meaning of published... thanks for pointing me to that... more interesting Julian code to peruse and learn from!
I'm probably using it in a "julia package sense" and maybe that's just my own private meaning :smile:.
@timholy, I am certainly interested in how you were able to get a matrix multiplication in Julia that's close to single threaded BLAS. I am interested in that as it would allow me to easily combine the microkernels of the matrix multiplication with the tensor contraction routines in TensorOperations.jl. I am working on a newer version of the latter but I won't have much time before the end of June.
I tried myself following this nice tutorial based on BLIS, but of course you don't get very far before needing something like SIMD types. I was also hoping that some automatic vectorization would kick in but was experimenting with Float64
and realized too late that this doesn't work (yet).
It's been a long time since I played with it, but this section is probably relevant. I don't think those were the tiling settings, though; I seem to remember 256-by-8 or something. The parameters turn out to matter quite a lot.
(and by "spitting distance" I meant 2-3x or something, i.e., still not a trivial gap.)
Working on AxisAlgorithms led me to wonder if we might benefit from yet more fancy iterators. Let's consider writing a matrix-multiplication algorithm that's conceptually equivalent to
(A*B')'
, i.e., multiplication along rows ofB
. Obviously,A_mul_Bt
does all of this except the final transpose, but take this as a simple prototype for a multidimensional generalization, e.g., multiplying along dimension 4 of a 6-dimensional array. In all of this,A
is a matrix.It seems that the way to write a tiled, high-performance algorithm would be something like this (here
C
is the output array):I haven't thought this through very deeply and the above could be flawed, but the main point is that since you have to reuse the same values repeatedly that are scattered over various non-adjacent spots in memory, it makes sense to iterate over chunks that have the alignment and size of a cache line.
There are numerous issues to tackle, for example how should this generalize to StridedArrays, etc. Tackling this issue would be an ambitious task, but one that might yield sizable performance benefits.
This might be fun for anyone who thinks we should implement BLAS operations in julia. Over in my unpublished KernelTools.jl package, I found that a well-tuned matmatmul, at least for
Float32
, could get within spitting distance of single-threaded BLAS computations (largely thanks to@simd
). So it's not a crazy thing to think about.CC @ArchRobison, @lindahua