CUDA non-determinism on identical requests

phiharri commented 1 year ago

When layers are offloaded with CUDA, sending identical requests to the examples/server completion API returns a different response the "first time":

$ for x in `seq 5`; do curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" hollow, glowing, tinkling, crunchy, sinking"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"

This seems cache related as responses then remain the same until a different prompt is processed, after which the differing first response occurs again:

$ curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Building a website is as simple as","n_predict":0}' >/dev/null

$ for x in `seq 5`; do curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" hollow, glowing, tinkling, crunchy, sinking"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
..

Expected Behaviour

Output should remain the same when parameters and seed are constant.

Other Observations

Not observed with Metal offload.
Not observed without CUDA offload (interestingly neither with small n-gpu-layers, eg. CodeLlama-34b shows this behaviour with -ngl 3 but not with -ngl 2).
Behaviour observed with both non-K and K-quants.
The first response ("hollow, glowing.." above) is what examples/main returns with the same parameters.

Environment

Verified behaviour on latest master commit, compiled with LLAMA_CUBLAS=1 make -j
Linux 5.15.0-79-generic x86_64
NVIDIA 535.86.05
CUDA 12.2
Python 3.10.12
GNU Make 4.3
g++ 11.4.0

Thanks for reading! 😎

staviq commented 1 year ago

Reproducible on c10704d01e21e3dbe4d6ca1026ebff85349dd239

server -m /storage/models/mythomax-l2-13b.ggmlv3.q5_K_M.gguf --port 7860 -c 4096 -b 512 -t 6 -ngl 43 -ts 80,101 --mlock

for x in `seq 5`; do curl -s -X POST --url 'http://localhost:7860/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" business, communication, community, education, health"
" greek, hellenistic, language, papyrus, text"
" greek, hellenistic, language, papyrus, text"
" greek, hellenistic, language, papyrus, text"
" greek, hellenistic, language, papyrus, text"

ReactorScram commented 11 months ago

Even on CPU inference I can't get any determinism from the example server at all:

$ for x in `seq 5`; do curl -s -X POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "Cells interlinked, within cells interlinked, within a single stem.", "n_predict": 16, "seed":1337}' | jq '.content' ; done
"\nA network of life, the most basic of forms, intertwined."
"\nA single stem, from which all others come.\nThis is not about"
" The stem is the source of life, the starting point for all. Leaves"
" The stem is the root of all life,\nFrom which the branches extend,"
" The stem is the root of all life,\nFrom which roots grow in multiple"

I'll probably try using the library directly, but I'm trying to use it in Rust and 2 of the popular Rust wrappers already had issues where they segfault or act odd, which is strange, because the C++ code itself hasn't segfaulted on me yet.

ggerganov commented 11 months ago

We seem to be ignoring the provided seed:

https://github.com/ggerganov/llama.cpp/blob/6bb4908a17150b49373b5f977685b2e180a04f6f/examples/server/server.cpp#L424

Probably a regression after the parallel decoding support to server was merged

ReactorScram commented 11 months ago

I could check it out as long as I'm here. I found the directions to run the CI locally, is there anything else I should do to test a patch?

Edit: Just need to free up some space on my disk, didn't realize the CI would need 15 35 GB to run lol. If someone ninja patches it before I get back to it, please document that in the CI's readme?

KerfuffleV2 commented 11 months ago

Probably a regression after the parallel decoding support to server was merged

I don't think there's currently any sane way to handle this in server (or just any case where you're doing parallel generation and you want different sequences to use a different seed).

The problem is the RNG is stored in the context and you can't save/restore its state, you can only set the seed. I think some backend stuff would need to change before this could actually be fixed. Like putting the RNG into the sampler state stuff, then each sequence could have its own RNG initialized to different seeds.

ReactorScram commented 11 months ago

We can split this off if my issue on CPU is different, but on bbecf3f4 it seems to work with cache_prompt false? Whatever commit I was using before didn't work, I think.

Is this intended behavior for cache_prompt? If not, is there a place I should add a regression test? I have a use case where I just need the LLM to be a classifier and I don't want it to be creative at all.

KerfuffleV2 commented 11 months ago

Hmm, we might be talking about slightly different things. It looks like you're using temperature 0 so RNG stuff shouldn't affect you at all. The problem I was talking about would apply in the case when the RNG would have an effect (so temperature > 0) and also parallel sequences that should each use a different seed.

So basically I think you can pretty much disregard both my comment and GG's about the RNG seed thing. It should not make a difference in your particular case.

ReactorScram commented 11 months ago

Okay. I just happened to find my CPU inference issue by searching the issue tracker for "determinism" so I'm not sure if I should start a new issue or what, mine isn't CUDA-related.

lmg-anon commented 11 months ago

I can confirm there's something wrong with cache_prompt: true. I just found the exact same problem of @ReactorScram while using the llama.cpp server API.

ReactorScram commented 11 months ago

Okay cool. I don't know exactly how it's supposed to work but I assume "cache" means it should yield the same output but faster, right? I did notice the 2nd inference for the same prompt is usually quicker. But it doesn't feed in previous prompts, it just caches the state in case 2 prompts share a prefix?

ReactorScram commented 11 months ago

@ggerganov I tried un-commenting that line, but it doesn't seem to compile because the seed can only be set on the llama_context, which is server-wide, and the requests come to a specific slot, and the slot only has a sampling context, for which I cannot set the seed.

I noticed that even if I disable caching, the seed isn't used - At temperature 0.0, the output never changes even if the seed changes. At 0.7 the output always changes, even if the seed doesn't.

main works as expected, so I'll just invoke main for my use case for now. It's fast enough if the model is mmaped, even without the caching feature.

I set up Doxygen in my fork so anything I learn as a new reader / contributor will have a place to be added. I'll make a PR for that if it's fine. Here's my local git log:

1650511 (HEAD -> master, me/master) server : extract assign_work_to_slot to reduce max indentation and function length
3fbfdec common : set up Doxygen comments in a few places
06de72f server : fix uninitialized members
a46349a docs : add Doxygen config
5d231da ci : document free disk space needed
b691e61 gitignore : build-info.h

KerfuffleV2 commented 11 months ago

I tried un-commenting that line, but it doesn't seem to compile

Yes, it got disabled when the sampling context stuff got added. The reason presumably was because it wasn't a simple fix.

But like I mentioned, that shouldn't really have anything to do with your issue because you're using temperature 0. If the issue only occurs when you have prompt caching turned on, then if you want to try to fix the issue you should probably look at the differences between when prompt caching is disabled vs enabled.

d-kleine commented 3 weeks ago

I confirm this behavior too. When disabling the KV cache, the output is deterministic. When keeping it enabled, the first output is different from the following ones (second, third, etc.), even when setting deterministic params (seed, temp set to 0, fixed ctx len, etc.). It seems like the cache is not updated with the same state as has been initialized.

ggerganov / llama.cpp