Open TuzelKO opened 1 month ago
Hi. GGUF kernels should theoretically work on AMD, but it's untested as I don't have regular access to AMD compute.
Multi-gpu should work fine on AMD. Tensor parallelism will split the model tensors evenly across GPUs. You simply need to launch the model with --tensor-parallel-size X
, where X
is the number of GPUs. I don't really recommend GGUF for this, because it doesn't seem to scale well at the moment. For AMD, you may want to do either GPTQ or FP8 W8A8 (through llm-compressor).
Hello! Having studied the documentation provided, I still could not understand whether there is support for GGUF quantized models on AMD GPU. I would like to use the Q8 or even Q4 model based on Mistral NeMo 12B in my project in order to slightly sacrifice quality for the sake of generation speed. We are planning to build a server with 4-6 Radeon 7900 XTX graphic cards.
AMD's solutions look more attractive than Nvidia's solutions in terms of performance/cost and performance/power consumption. Especially for small startups.
I would also like to know whether it is possible to run one small model (for example, Mistral NeMo 12B) in parallel on several graphic cards. This does not mean splitting the model into several cards, but running the same model with full placement in the VRAM on each card. Or will I need to run a separate container for each graphic card?
In our project we are considering using the Magnum v2 12B model (https://huggingface.co/anthracite-org/magnum-v2-12b-gguf). We are currently running it through llama.ccp, but it seems that it is not very well designed to handle parallel requests from multiple users.