Fixes https://github.com/ggerganov/llama.cpp/issues/8117 . The problem is the ldmatrix shared memory reads. The problem is that half the threads have a garbage address (due to me copypasting the code and not adapting it correctly). Those address are never supposed to be used. However, the documentation reads:
For .target sm_75 or below, all threads must contain valid addresses. Otherwise, the behavior is undefined. For .num = .x1 and .num = .x2, addresses contained in lower threads can be copied to higher threads to achieve the expected behavior.
So in a sense it's lucky that the bad addresses cause a crash instead of being ignored on Turing because otherwise I may have never noticed this.
Fixes https://github.com/ggerganov/llama.cpp/issues/8117 . The problem is the
ldmatrix
shared memory reads. The problem is that half the threads have a garbage address (due to me copypasting the code and not adapting it correctly). Those address are never supposed to be used. However, the documentation reads:So in a sense it's lucky that the bad addresses cause a crash instead of being ignored on Turing because otherwise I may have never noticed this.