CUDA: fix misaligned shared memory read

Fixes https://github.com/ggerganov/llama.cpp/issues/8117 . The problem is the ldmatrix shared memory reads. The problem is that half the threads have a garbage address (due to me copypasting the code and not adapting it correctly). Those address are never supposed to be used. However, the documentation reads:

For .target sm_75 or below, all threads must contain valid addresses. Otherwise, the behavior is undefined. For .num = .x1 and .num = .x2, addresses contained in lower threads can be copied to higher threads to achieve the expected behavior.

So in a sense it's lucky that the bad addresses cause a crash instead of being ignored on Turing because otherwise I may have never noticed this.

ggerganov / llama.cpp

CUDA: fix misaligned shared memory read #8123