ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.6k stars 9.25k forks source link

Bug: GGML_HIP_UMA causes consistency errors #8496

Closed jeroen-mostert closed 5 days ago

jeroen-mostert commented 1 month ago

What happened?

GGML_HIP_UMA=1 is a build flag implemented to speed things up for AMD iGPUs by using hipMallocManaged to get managed memory and then using hipMemAdvise to set this memory to coarse grained. This last step, however, causes the output to be nondeterministic. To most people LLM text generation is not an exact science, so I'm not sure if this should be called a bug, but it is at least something to be aware of. And possibly fix it with no perf loss if possible, but I don't know what that involves.

Repro'ing this needs the following ingredients: have a Linux build with GGML_USE_HIPBLAS=1 GGML_HIP_UMA=1 at the ready as well as an AMD iGPU that works (for example by using HSA_OVERRIDE_GFX_VERSION), and then invoke a model (any model) with parameters that should cause it to give the same output every time. I'm using Phi 3 mini as an example since it's reasonably zippy even on very weak GPUs.

./llama-cli -m ~/models/Phi-3-mini-4K-instruct-q4.gguf --seed 42 --sampling-seq k --top-k 1 -sm none -ngl 0 -f diagnoser.txt

Prompt attached, though the contents don't matter so much as the size. You just need a reasonably large prompt to repro the issue. Note that we're not offloading any layers here, just prompt processing.

The output may be more or less insightful depending on your model (Phi 3 mini isn't bad, but not particularly good either), but the point is repeating this command. With a fixed seed and disabled top-K there should be no variance as long as we're calculating right, but if you're doing this with an iGPU, you should observe different outputs (it may take a few tokens, but they will diverge).

This will not happen with a dGPU; using managed memory for those is just much slower with no benefit. You specifically need an iGPU where the CPU and GPU can trample on each other's feetsies. This will also not happen if the code is modified to remove the hipMemAdvise call, showing that changing things to coarse grain is the issue, likely because attempts that should be made to make sure concurrent accesses don't happen in fact aren't made. You do take a perf hit in this case -- a pretty minor one on my system, but as I'm testing this with the iGPU of a Ryzen 5 7600 (which barely scratches up to the CPU in perf) it can't be considered representative.

Name and Version

version: 3400 (97bdd26e) built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu

ROCm 6.1.2, in case it matters.

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Haus1 commented 1 month ago

Does it still happen with --no-kv-offload? This may be part of a wider issue with AMD and their LLVM implementation for BF16 / FP16 being incorrect. edit: Ref https://github.com/llvm/llvm-project/pull/71470/commits/6e0ed50893c698cd94db1cad0a5274495007630c

jeroen-mostert commented 1 month ago

Yes, it still happens with -nkvo.

If it's a wider issue, it's curious that it does not repro on my dGPUs. Furthermore, if it was due to some miscompilation of a kernel somewhere, this would not explain the nondeterminism, correct me if I'm wrong. You could (and in fact do) get different results depending on how you offload, but as long as you've eliminated randomness and your hardware isn't itself broken repeating the same scenario should yield the same result (even if it's a wrong result).

Haus1 commented 1 month ago

If you can include the output of rocminfo and hipconfig it may help with pointing you in the right direction. Setting AMD_LOG_LEVEL=4 can also be helpful.

I'm guessing it's a pre-vega device by the rOCM version you're using, but there are so many factors involved with these things that all I can do is offer workarounds that are known to correct it on other setups and see if it's the same for you too.

As for upstream driver bugs, Chromium is the best source I'm aware of for finding out about these things. The easiest way being to enable hardware acceleration and simply pull up chrome://gpu where it has it all organized and written out for you. https://chromium.googlesource.com/chromium/src/+/refs/tags/127.0.6533.59/gpu/config/gpu_driver_bug_list.json

Determinism and non-determinism in this context is somewhat ambiguous, what you're really after is mathematical correctness.

jeroen-mostert commented 1 month ago

I'm guessing it's a pre-vega device by the rOCM version you're using

How do you figure that? Does ROCm 6 have some special support for them that earlier versions didn't have? I'm using https://github.com/lamikr/rocm_sdk_builder specifically so we have the latest and greatest. But before anyone jumps on that: the problem also repros with the vanilla ROCm 6.0.2 that's available for Manjaro, it's not something specific to this build.

Edit: I suspect you're getting misled by the fact that rocminfo can output things like ROCk module version 6.7.0 is loaded. This is unrelated to the ROCm SDK version. It takes this from /sys/module/amdgpu/version, and this version is apparently only filled in if you have installed AMD's proprietary drivers. The version that comes with ROCm 6.1.2 indeed has a 6.7 version number, for example. Not an option if you're not on RHEL, Ubuntu or Suse, apparently.

The GPU that's embedded in the Ryzen 5 7600 is Raphael aka gfx1036, which is an RDNA 2 unit with 2 CUs. I would be interested if any other users can repro the issue with other iGPUs, as I suspect it's not specific to any model. Especially because the issue can be made to disappear by changing the memory allocation type from coarse to fine grained. It might be tied to driver version, depending on how synchronization is implemented in there, but I think we're actually just getting what we asked for: unsynchronized access to memory, a privilege that we're abusing.

To be clear: the output is not unusable or clearly incorrect, it's just not the same from run to run. I suspect this is something that may well have gone unnoticed until now because most users will be happy enough just getting things to run on their iGPU without caring if it's strictly speaking doing what it should be doing.

Thank you for the AMD_LOG_LEVEL hint, that looks useful. The amount of logging it spits out is insane and nearly undiffable thanks to the inclusion of things like pointer values, but that's nothing a little elbow grease can't fix.

jeroen-mostert commented 1 month ago

OK, I remembered something that might be very relevant: I've been testing with Linux 6.10, which has an important change in how memory allocation is being handled for iGPUs. Specifically things now go directly to GTT and avoid a bunch of copy overhead in the older kernels. Which is great but also exactly the sort of thing that could expose bugs like this. I will retest with kernel 6.9 when I have the chance.

Haus1 commented 1 month ago

huh, somehow I misread 6.0 as 5.6.

ironically, if it's a gfx1036 that may be the solution for you https://github.com/ROCm/ROCm/discussions/2867

Xnack support typically seems to be the first thing to go, but if they don't even want to bother supporting it on an integrated platform I'm starting to think they're planning an exit from the GPU side of things.

Have you checked to see if there is any debug code in place to help with dumping tensors? They may be getting truncated or using uninitialized memory.

github-actions[bot] commented 5 days ago

This issue was closed because it has been inactive for 14 days since being marked as stale.