back of the envelope estimate for performance ( roofline plot )
Decrease the amount of shared memory used per kernel ( note that some amount of memory is used by the CUDA runtime ). Useful if limiting occupancy or limited by the number of operations.
Increase the arithmetic intensity by reducing the concurrent amount of loads and stores. Can be done by computing for than once C element per thread.
GPU can have vectorization. Instead of .32 instructions one should get .128 instructions for loading data/ operations. It might require looking at the SSA code
Scope making an exercise based on matrix matrix multiplication.
Optimization guide for CUDA:
List of optimizations: