JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.43k stars 5.45k forks source link

An API for Cache Size(s) #6649

Open andrewcooke opened 10 years ago

andrewcooke commented 10 years ago

From #6517 - _Currently the cache size information is independently hardwired in several methods throughout Julia Base, e.g. in Base.transpose! and in Base.LinAlg.genericmatmatmul, and now also in my implementation of permutedims! (actually in the auxiliary function blockdims). It would probably make sense to centralise this information and maybe even to try to get it from the system (or to set it at build time) instead of fixing it at 32k. - @Jutho

There isn't a new issue to discuss this, but there probably should be. L1 and L2 cache sizes would be nice to have available, but it may require some effort to get this right across OSes and processor types. - @ViralBShah

stevengj commented 10 years ago

I'd rather we just use cache-oblivious algorithms where possible. These generally lead to cleaner code than blocked algorithms, exploit multiple levels of cache, and often have comparable performance to straightforward blocking implementations (assuming they are implemented properly, i.e. you have a large enough base case).

StefanKarpinski commented 10 years ago

Yes, I second that. Tuning code to cache sizes tends to be pretty brittle and only beats cache-oblivious algorithms if truly go to town and optimize for a very particular architecture. It may be worth doing this in a BLAS, for example, but I think that if we can get close without the brittleness, it's a win.

Jutho commented 10 years ago

I actually first tried to implement the algorithms in TensorOperations.jl using cache-oblivious algorithms. While I was already able to create some speed-up for my version of e.g. permutedims, there certainly was some overhead.

I used recursive algorithms because I did not see an easy approach of implementing this within a normal iterative approach. I also had to tweak a little bit to get the recursive approach to work in combination with the cartesian macros. So the term "properly implemented" did probably not apply to my code. I can give this a new try.

How would you choose the size of the base case? Using the size of the level 1 cache ? :smile:

I also had some further questions regarding how the cache-friendly algorithms are currently approached in the Julia base Library. In e.g. Base.tranpose! I see the following:

First the value of sqrthalfcache=1<<7 is defined, which assumes that level 1 cache size is 32k (1<<15). But then, the blocksize is defined as ifloor(sqrthalfcache/elsz/1.4), meaning that the actual memory size of one block will be elsz * blocksize^2 = sqrthalfcache^2 / elsz / 2 = cachesize / elsz / 4 (assuming 1.4 is sqrt(2) ). The total cache taken by this algorithm (if I need to count both the source block and the destination block) is 2 * elsz * blocksize^2 = cachesize / elsz / 2. So there is definitely one division by elsz too much is this approach, and there seems to be an overall safety factor of 2, which seems like a lot (definitely much more than in generic_matmatmul). In the end, a non-blocking strategy for transpose! is used if the matrices are smaller then 4_blocksize^2, in which case the total data size handled by the algorithm (counting source and destination) is 2_cachesize/elsz , which would be double the size of the cache, if it were not for the erroneous extra division by elsz.

I guess my main question is: In something like transpose or permutedims, do I only need to count the data that is read (the source array) as occupying the cache, or also the data that simply is written (the dest array)

andrewcooke commented 10 years ago

i think there's a big difference between providing this information (which would be for external use as well as for julia) and mandating that all julia algorithms use it!

provide the best solution you can. if that doesn't use cache size, cool. but saying that this information "should not" be provided so that people are forced to use cache-oblivious algorithms is needlessly authoritarian. if cache-oblivious is best, it should be used whether this is here or not. this api can remain for other cases.

in other words - whether or not you use cache oblivious algorithms is completely orthogonal to this api.

StefanKarpinski commented 10 years ago

I guess that's a fair point. Sometimes cache-oblivious algorithms are not known or one can't be bothered to come up with one and just wants to know the cache size. If people are going to want to know that, we really ought to provide one good standard way of finding out, rather than forcing everyone to invent their own half-assed way of figuring out cache sizes.

ViralBShah commented 10 years ago

Yes, these are two different issues. We should certainly use cache oblivious algorithms where possible, but also provide a standard way to figure out cache sizes.

stevengj commented 10 years ago

@Jutho, the size of the base case is chosen to be roughly the smallest size that is big enough to amortize the recursion overhead. This is usually much smaller than any realistic L1 cache.

(See, for example, how cache-oblivious transpositions are implemented in FFTW.)

eschnett commented 10 years ago

The hwloc library http://www.open-mpi.org/projects/hwloc/ provides an API to identify cores, caches, and memory sizes, and is very portable. (OpenMPI uses it.)

hwloc generates a tree that describes the hardware. There is a C API to walk/query the tree. In Julia, this tree structure could be directly translated to a tree structure, which may be easier to examine in Julia code.

As added bonus, hwloc also identifies PCI devices, accelerator, etc., depending on how it is configuration.

StefanKarpinski commented 10 years ago

That would be really handy. It might make sense to just include it in Base Julia if we use it – the dynamic library it builds is only 172K on my system, which is pretty small compared the rest of Julia.

tkelman commented 10 years ago

+1, never heard of hwloc before but looking at it for a few minutes, appears very useful, well-behaved, straightforward to build. They are even nice enough to provide Win32 and Win64 MinGW-compiled binaries (they have a nice nickname for them - winballs) which would make packaging via BinDeps really easy. Wrapping their C API in Julia code would be a great project.

StefanKarpinski commented 10 years ago

I really like the term winballs :-)

eschnett commented 9 years ago

As mentioned above, hwloc is a library that finds out the local cache sizes. See https://github.com/eschnett/hwloc.jl, although the API for accessing cache sizes is still very low-level.

eschnett commented 9 years ago

This would be how to obtain the L1 data cache size via hwloc:

import hwloc
topology = hwloc.topology_load()
l1cache = first(filter(t->t.type_==:Cache && t.attr.depth==1, topology)).attr
println("L1 cache information: $l1cache")
L1 cache information: Cache{size=32768,depth=1,linesize=64,associativity=0,type=Data}