ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.07k stars 9.33k forks source link

Bug: b3188 breaks row split mode for multiple GPUs #8801

Closed m-arbaro closed 1 week ago

m-arbaro commented 1 month ago

What happened?

Since commit b3188 llama-cli produce incoherent output on multi-gpu system with CUDA and row tensor splitting. Layer tensor split works fine but is actually almost twice slower. GPU are 3x Nvidia Tesla + 3090 All future commits seems to be affected.

Name and Version

llama-cli version b3188 built on Debian 12.

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

JohannesGaessler commented 1 month ago

Which model and GPUs were you using? Do you get correct results with -b 512 -ub 512? Do you get correct results when compiling with GGML_CUDA_FORCE_CUBLAS?

m-arbaro commented 1 month ago

Hello Johannes, Thank you for your guidance. I use Tesla P40.

Do you get correct results with -b 512 -ub 512 No Do you get correct results when compiling with GGML_CUDA_FORCE_CUBLAS? Yes, it works fine with this option. Thank you.

JohannesGaessler commented 1 month ago

Which model are you using?

m-arbaro commented 1 month ago

Both llama3 instruct and 3.1 instruct (on the latest builds, that are support her), Q8_0 quantizaion.

ewandel commented 1 month ago

I can confirm this observation. Meta-Llama-3.1-70B-Instruct-IQ2_M works fine without "row_split", but when using "row_split" it only produces gibberish (in my case the output is only a string of "////////////,////,///" etc. continuing.

Model source: https://huggingface.co/lmstudio-community/Meta-Llama-3.1-70B-Instruct-GGUF/tree/main

System: Dual RTX 3090 setup, Windows, https://github.com/oobabooga/text-generation-webui, v.1.13

Setting screenshot below.
image

JohannesGaessler commented 1 month ago

Does this issue still occur on the latest master commit?

gpez-git commented 1 week ago

Does this issue still occur on the latest master commit?

Yes - I have 3xP40 and 1x4060TI. Output using -rowsplit is a single repeating word. Removing rowsplit works fine (albeit slower). For smaller models that fit entirely on the P40s -rowsplit works fine. Given that and lack of additional complaints/bug reports I'm curious if something broke with rowsplitting across non-homogonous nvidia archs.

luzamm commented 1 week ago

same issue, 3 RTX2080Ti, Mistral-Large-Instruct-2407.i1-Q2_K.gguf, b3678 build with GGML_CUDA_FORCE_CUBLAS=false breaks -sm row and GGML_CUDA_FORCE_CUBLAS=true fixes it

JohannesGaessler commented 1 week ago

Please confirm whether or not this fix works: https://github.com/ggerganov/llama.cpp/pull/9413