ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.68k stars 9.43k forks source link

Feature Request: Add native int8 pure CUDA Core accelerate for pascal series graphics cards(Like:Tesla P40,Tesla P4) #9578

Open SakuraRK opened 1 week ago

SakuraRK commented 1 week ago

Prerequisites

Feature Description

I have a Tesla P40 buy from China's second-hand webstore for LM inference.But I found the LCPP doesn't use the cuda int8 on dequantization/inference. The inference speed same to f16.

Motivation

Provides quantization acceleration for devices that native support cuda int8. Inference acceleration for devices that do not have high-performance FP16/FP32 computing capability.

Possible Implementation

No response

JohannesGaessler commented 1 week ago

I don't know how you went about determining this, but the corresponding CUDA code in ggml/src/ggml-cuda/mmq.cuh and ggml/src/ggml-cuda/mmvq.cu absolutely does use the __dp4a instruction to take advantage of int8 arithmetic. The only circumstances in which this code would not be used is if you were to compile with GGML_CUDA_FORCE_DMMV or GGML_CUDA_FORCE_CUBLAS.

SakuraRK commented 6 days ago

I don't know how you went about determining this, but the corresponding CUDA code in and absolutely does use the instruction to take advantage of int8 arithmetic. The only circumstances in which this code would not be used is if you were to compile with or .ggml/src/ggml-cuda/mmq.cuh``ggml/src/ggml-cuda/mmvq.cu``__dp4a``GGML_CUDA_FORCE_DMMV``GGML_CUDA_FORCE_CUBLAS

Got it and thanks. I'm currently suspecting that the author of the a-i-o pack didn't compile it in or something wrong. I'm currently confused for a long time because of another problem that arose during installation. So using someone else's all-in-one pack.