ggerganov / llama.cpp

LLM inference in C/C++
MIT License
60.95k stars 8.7k forks source link

CUDA: fix MMQ writeback for int8 tensor cores #8100

Closed JohannesGaessler closed 4 days ago

JohannesGaessler commented 4 days ago

The logic that I implemented in https://github.com/ggerganov/llama.cpp/pull/8062 was not quite correct. I added an offset to a pointer but forgot that then the out-of-bounds checks relative to that pointer would also need to be adjusted. I assume this PR fixes https://github.com/ggerganov/llama.cpp/issues/8096 (need confirmation).