ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.21k stars 9.35k forks source link

Bug: -DCMAKE_CUDA_ARCHITECTURES=52 on GTX 1660 Ti or RTX 3060 results in incorrect output #9019

Open cebtenzzre opened 1 month ago

cebtenzzre commented 1 month ago

What happened?

The v3.2.0 release of GPT4All was effectively built with -DCMAKE_CUDA_ARCHITECTURES=52 due to a mistake in the build scripts. I would expect this to still work an an GTX 1660 Ti or RTX 3060, albeit with reduced performance, since PTX should be fully forward-compatible. It might even be OK if it failed an assertion due to known incompatibility with the newer GPUs. However, both of these GPUs exhibited nonsense generation instead, which I did not expect.

Building with -DCMAKE_CUDA_ARCHITECTURES="52;61;70;75" fixed the nonsense generation for the RTX 3060 user.

cc @slaren @JohannesGaessler

--

Llama 3.1 8B Instruct 128k on a GTX 1660 Ti:

Video https://github.com/user-attachments/assets/d513c1fe-c430-4b40-8b60-6631068280e9

--

Phi-3 Mini Instruct on an RTX 3060:

Image ![image](https://github.com/user-attachments/assets/e1f35ce9-7b3b-4818-b3a8-b72a7c23cfd0)

--

Llama 3.1 8B Instruct 128k on an RTX 3060:

Image ![image](https://github.com/user-attachments/assets/b0303cad-e9f5-4c7a-bb86-c3ba58a4b9aa)

Name and Version

This has only been reproduced on this fork (commit https://github.com/nomic-ai/llama.cpp/commit/443665aec4721ecf57df8162e7e093a0cd674a76) so far (based on 87e397d00), and I do not have any newer NVIDIA GPUs so I cannot easily confirm whether it is present on the latest master here myself. This issue does not seem to be present on compute 6.1 GPUs (Pascal).

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

JohannesGaessler commented 1 month ago

The llama.cpp host code does not depend on which compute capabilities the code was compiled for. It will always select the same kernelse regardless of which compute capabilities the code was compiled for. There are llama.cpp CUDA kernels that use features which need a specific minimum compute capability. They simply cannot be compiled for a lower compute capability. Instead the kernel is supposed to call NO_DEVICE_CODE which simply causes a crash instead of incorrect results. NO_DEVICE_CODE was broken for some time on master but I don't think that this is the issue here.

When I manually set a lower compute capability I instead get the following runtime error:

CUDA error: the provided PTX was compiled with an unsupported toolchain.

It is known and expected that the code will not work correctly if compiled for the wrong compute capabilities. The bug is that the program didn't crash.

cebtenzzre commented 1 month ago
CUDA error: the provided PTX was compiled with an unsupported toolchain.

When I ran into this, I found that it was because I had built against CUDA 12.5 but nvidia-smi showed only CUDA 12.4, due to NVIDIA's Linux driver releases being behind. I think that error is meant to be interpreted literally—there is an incompatibility between the driver and your toolchain, so you are not able to use the built PTX at all, only CUBINs. We started building GPT4All against CUDA 11.8 specifically to avoid this problem with GPUs newer than compute 7.5 on older drivers without having to build for more architectures.

You will likely see the same thing if you set e.g. -DCMAKE_CUDA_ARCHITECTURES=61-virtual, assuming you are testing on compute 6.1.