ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.66k stars 9.87k forks source link

Bug: IQ3_M is significantly slower than IQ4_XS on AMD, is it expected? #9644

Open Nekotekina opened 2 months ago

Nekotekina commented 2 months ago

What happened?

Model: https://huggingface.co/bartowski/gemma-2-27b-it-GGUF AMD GPU: RX 7600 XT + RX 7600 (full offload) With IQ3_M I get about 10 t/s when IQ4_XS is nearly 15 t/s. I thought smaller models would run faster due to lessened memory bandwidth, and they are both IQ.

Name and Version

version: 3827 (7691654c) built with Ubuntu clang version 14.0.6-1~kisak1~j for x86_64-pc-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Nekotekina commented 2 months ago

Oops, I'll retest master branch.

Nekotekina commented 2 months ago

Retested with latest version, the same result.

BrickBee commented 2 months ago

Potentially related to issue 8760 which also mentions the difference between (IQ1, IQ2, IQ3) and (IQ4 / K)

Nekotekina commented 1 month ago

On NVidia (3090), IQ3_M is faster than IQ4_XS (~40t/s against ~35t/s)

grapevine-AI commented 2 weeks ago

But, On 1x NVIDIA 3090 (DDR4-offload), IQ3_S and IQ3_M are slower than IQ4_XS (about 0.5x speed) I seem that Only NVIDIA can deal IQ3 with highspeed.