[Usage]: GGUFed models on AMD GPUs

Hello! Having studied the documentation provided, I still could not understand whether there is support for GGUF quantized models on AMD GPU. I would like to use the Q8 or even Q4 model based on Mistral NeMo 12B in my project in order to slightly sacrifice quality for the sake of generation speed. We are planning to build a server with 4-6 Radeon 7900 XTX graphic cards.

AMD's solutions look more attractive than Nvidia's solutions in terms of performance/cost and performance/power consumption. Especially for small startups.

I would also like to know whether it is possible to run one small model (for example, Mistral NeMo 12B) in parallel on several graphic cards. This does not mean splitting the model into several cards, but running the same model with full placement in the VRAM on each card. Or will I need to run a separate container for each graphic card?

In our project we are considering using the Magnum v2 12B model (https://huggingface.co/anthracite-org/magnum-v2-12b-gguf). We are currently running it through llama.ccp, but it seems that it is not very well designed to handle parallel requests from multiple users.

PygmalionAI / aphrodite-engine

[Usage]: GGUFed models on AMD GPUs #632