Registers remapping in matrix multiplication

AndrewBoWen666 commented 3 years ago

Hi, Im researching the paper "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking". I follow the step of the first part of the article which is registers remapping. In the paper, the registers' indices used for holding the tiles of A and B matrix are interleved. As shown in following picture. The regs mapping is generated by NVCC 9.0 as described by author.

As I use cuAssembler with NVCC 10.2, the generated regs mapping is as follow. You can see all the regs used for tile A and B are with even indices, which means all the regs used for tile C storage should be with odd indices (for avoiding reg bank conflict as FFMA op). But, as detailed in the second pic, the regs for C storage are mixed with odd and even indices, indicating the odd regs are not enough to prevent bank conflict. Thus, I try two potential methods to solve such issue. I firstly try changing some odd regs for tile A and B storage, insted of tile C, however, this results in cuda instruction error (CUDA error at matrixMul.cu:144 code=715(cudaErrorIllegalInstruction)) as I run the app. Then, I try to modify some even reg indices to to odd, for instance, R42, R40, R38 to R95, R97, R99. Unfortunately, this modification give me same error msg. So I have three questions:

Is it possible to change regs' attribute once the compiler has defined them. Such as: [----:B------:R-:W-:-:S01] /0170/ CS2R R42, SRZ ; [----:B------:R-:W-:-:S02] /02f0/ MOV R93, RZ ; [----:B------:R-:W2:-:S04] /0360/ LDG.E.SYS R48, [R2] ; [----:B------:R-:W2:-:S04] /0370/ LDG.E.SYS R50, [R4]; [----:B--2---:R-:W-:-:S02] /04a0/ FFMA R42, R48, R50, R42 ; to [----:B------:R-:W-:-:S01] /0170/ CS2R R42, SRZ ; [----:B------:R-:W-:-:S02] /02f0/ MOV R48, RZ ; [----:B------:R-:W2:-:S04] /0360/ LDG.E.SYS R93, [R2] ; [----:B------:R-:W2:-:S04] /0370/ LDG.E.SYS R50, [R4]; [----:B--2---:R-:W-:-:S02] /04a0/ FFMA R42, R93, R50, R42;
Is it possible to use more reg beyond the reg num delcared by compiler. For instance, the maximum reg index in my case is R93, but I would like to use R95.
If 1 and 2 are not possible, is there any other method can be used to optimize bank conflict in sunch case.

Many thanks!

cloudcores commented 3 years ago

You can change the register used in a kernel by modifying the regnum line in section header, such as: .sectioninfo @"SHI_REGISTERS=32"

NOTE: for arch since and after Volta, 2 extra registers are occupied for unknown use. For example, if the largest index of GPR you have used in your kernel is R41, then you have used 42 GPR, but you need to set the SHI_REGISTERS to 42+2=44.
No. GPR resources are allocated statically, you will get the "cudaErrorIllegalInstruction" for accessing out of boundary GPR.
Register remapping is one of the ways to avoid the bank conflict statically, another important way is to utilize the reuse cache. You can confer the wiki pages of maxas (NOTE: the bank conflict pattern has changed for Volta/Turing/Ampere vs Maxwell/Pascal, but the reuse still works):

https://github.com/NervanaSystems/maxas/wiki/SGEMM

BTW: interleaved GPR for A/B/C are generally not a good idea for matmul, this means you cannot utilize the vector load feature, which needs continuous GPR to work(64bit vector load needs 2 continuous GPR, 128bit load needs 4). That's why usually at least 4 continuous GPR pattern is found in many MatMul implementations.

AndrewBoWen666 commented 3 years ago

@cloudcores Thanks for your reply. Regarding the reply of 'interleaved GPR for A/B/C', did you mean lets's say FFMA R41,R39,R40,R41 is better than FFMA R40, R38, R40, R43? Is the 4 continuous GPR pattern like FFMA R38,R39,R40,R41?

cloudcores commented 3 years ago

For most GEMM, A/B/C elements will be loaded from global or shared memory. If they were stored in interleaved GPR, you cannot utilize the vector load instructions such as:

LDG.E.128.SYS R4, [R4]; // 128bit global load, this will write to R4, R5, R6, R7
LDG.E.64.SYS R8, [R2] ; // 64bit global load, write to R8, R9

LDS.U.128 R12, [R2]; // 128bit shared load, write to R12, R13, R14, R15
LDS.U.64 R4, [R0] ;   // 64bit shared load, write to R4, R5

Vector load needs continuous GPR (and aligned, 128bit GPR group should start with 0, 4, 8, ...) to work. Which means if you have interleaved GPR destination, a 128bit load will require 4 seperate 32bit load instruction, and 64bit needs 2. This will probably slow down the program due to more instructions and memory requests.

A/B GPR are usually grouped. For C, sometimes you need to shuffle before write back for better coalescing, thus C GPR may have different pattern.

AndrewBoWen666 commented 3 years ago

@cloudcores A great explanation! Many thanks.

cloudcores / CuAssembler

Registers remapping in matrix multiplication #3