[QST] `64x64` Layout - Githubissues

jeromeku commented 10 months ago

Thanks for the thorough investigation of high-performance GEMM kernels using Cutlass.

I can't quite grasp how layouts map from nested shapes and strides to an actual thread-matrix element assignment. In Listing 1 of your paper, you have:

using CLayout_64x64 = Layout<Shape <Shape < _4,_8, _4>,Shape < _2,_2, _8>>,
Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>

I was hoping you could walk through how the stride sizes were derived, and moreover, how these shapes and strides can then be used to map to the layout as visualized in Figure 2?

Most helpful would be a simple calculation that shows how the (T, V) thread / value coordinate space is transformed into the (M, N) matrix coordinate space as parameterized by the shapes and strides.

Many thanks!

jayhshah commented 10 months ago

Sure, let's go through it. The Figure 2 from the paper on the structure of the WGMMA accumulator is taken directly from the PTX documentation (Figure 118 in section 9.7.14 there), and the layout is then defined so as to reproduce that figure in terms of its associated layout function. So let's try evaluating the layout function associated to the layout with

shape = ((4,8,4),(2,2,8)) stride = ((128,1,16),(64,8,512))

In the paper we wrote about how a layout function can itself be defined as a composition of two functions: a 'column-major' traversal isomorphism determined by the shape tuple, and then a multilinear function determined by the stride tuple. In our case though, instead of a layout function from [0,128x32) to N, we should interpret it as from [0,128)x[0,32) to [0,64)x[0,64) as I will explain.

Firstly, the shape tuple transforms the two-dimensional (T,V) coordinate in [0,128)x[0,32) into a 6-dimensional coordinate according to the following formula:

(T,V) -> ((T%4,floor(T/4)%8,floor(T/32)%4), (V%2,floor(V/2)%2, floor(V/4)%8)

Here, you can view the inner tuples of length 3 in the shape tuple as defining the two functions transforming T and V respectively. (You could also start with a 1-dimensional coordinate in [0,128x32) and associate to it a 6-dimensional coordinate according to the flattened shape tuple (4,8,4,2,2,8).)

Secondly, the stride tuple then defines the multilinear mapping

((a,b,c),(x,y,z)) -> 128a+b+4c+64x+8y+512z

Finally, the one-dimensional codomain coordinate becomes a 2-dimensional coordinate on the 64x64 tile by the formula

k -> (k%64,floor(k/64)%64)

In total, this defines the (T,V) -> (M,N) mapping. For example, we have

T1D0 = (1,0) -> ((1,0,0),(0,0,0)) -> 128 -> (0,2) T4D0 = (4,0) -> ((0,1,0),(0,0,0)) -> 1 -> (1,0) T32D0 = (32,0) -> ((0,0,1),(0,0,0)) -> 16 -> (16,0) T0D1 = (0,1) -> ((0,0,0),(1,0,0)) -> 64 -> (0,1) T0D2 = (0,2) -> ((0,0,0),(0,1,0)) -> 8 -> (8,0) T0D4 = (0,4) -> ((0,0,0),(0,0,1)) -> 512 -> (0,8)

and evaluation at all other coordinates can then be determined using linearity. Moreover, reversing the line of reasoning is how you arrive at the correct definition of the layout.

I also suggest going through the CuTe documentation "0t_mma_atom.md" for more worked examples.

jeromeku commented 10 months ago

@jayhshah

Beautiful explanation. I read 0t_mma_atom.md but your walk-through really elucidated the matter.

Many thanks for the wonderful paper and follow-up.

ColfaxResearch / cutlass-kernels

[QST] `64x64` Layout #2