Feature Request: Add native int8 pure CUDA Core accelerate for pascal series graphics cards(Like:Tesla P40,Tesla P4)

SakuraRK commented 1 week ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I have a Tesla P40 buy from China's second-hand webstore for LM inference.But I found the LCPP doesn't use the cuda int8 on dequantization/inference. The inference speed same to f16.

Motivation

Provides quantization acceleration for devices that native support cuda int8. Inference acceleration for devices that do not have high-performance FP16/FP32 computing capability.

Possible Implementation

No response

JohannesGaessler commented 1 week ago

I don't know how you went about determining this, but the corresponding CUDA code in ggml/src/ggml-cuda/mmq.cuh and ggml/src/ggml-cuda/mmvq.cu absolutely does use the __dp4a instruction to take advantage of int8 arithmetic. The only circumstances in which this code would not be used is if you were to compile with GGML_CUDA_FORCE_DMMV or GGML_CUDA_FORCE_CUBLAS.

SakuraRK commented 6 days ago

I don't know how you went about determining this, but the corresponding CUDA code in and absolutely does use the instruction to take advantage of int8 arithmetic. The only circumstances in which this code would not be used is if you were to compile with or .ggml/src/ggml-cuda/mmq.cuh``ggml/src/ggml-cuda/mmvq.cu``__dp4a``GGML_CUDA_FORCE_DMMV``GGML_CUDA_FORCE_CUBLAS

Got it and thanks. I'm currently suspecting that the author of the a-i-o pack didn't compile it in or something wrong. I'm currently confused for a long time because of another problem that arose during installation. So using someone else's all-in-one pack.

ggerganov / llama.cpp