PygmalionAI / aphrodite-engine

Large-scale LLM inference engine
https://aphrodite.pygmalion.chat
GNU Affero General Public License v3.0
1.1k stars 120 forks source link

[Usage]: GGUFed models on AMD GPUs #632

Open TuzelKO opened 1 month ago

TuzelKO commented 1 month ago

Hello! Having studied the documentation provided, I still could not understand whether there is support for GGUF quantized models on AMD GPU. I would like to use the Q8 or even Q4 model based on Mistral NeMo 12B in my project in order to slightly sacrifice quality for the sake of generation speed. We are planning to build a server with 4-6 Radeon 7900 XTX graphic cards.

image

AMD's solutions look more attractive than Nvidia's solutions in terms of performance/cost and performance/power consumption. Especially for small startups.

I would also like to know whether it is possible to run one small model (for example, Mistral NeMo 12B) in parallel on several graphic cards. This does not mean splitting the model into several cards, but running the same model with full placement in the VRAM on each card. Or will I need to run a separate container for each graphic card?

In our project we are considering using the Magnum v2 12B model (https://huggingface.co/anthracite-org/magnum-v2-12b-gguf). We are currently running it through llama.ccp, but it seems that it is not very well designed to handle parallel requests from multiple users.

AlpinDale commented 1 month ago

Hi. GGUF kernels should theoretically work on AMD, but it's untested as I don't have regular access to AMD compute.

Multi-gpu should work fine on AMD. Tensor parallelism will split the model tensors evenly across GPUs. You simply need to launch the model with --tensor-parallel-size X, where X is the number of GPUs. I don't really recommend GGUF for this, because it doesn't seem to scale well at the moment. For AMD, you may want to do either GPTQ or FP8 W8A8 (through llm-compressor).