Open phiharri opened 1 year ago
Reproducible on c10704d01e21e3dbe4d6ca1026ebff85349dd239
server -m /storage/models/mythomax-l2-13b.ggmlv3.q5_K_M.gguf --port 7860 -c 4096 -b 512 -t 6 -ngl 43 -ts 80,101 --mlock
for x in `seq 5`; do curl -s -X POST --url 'http://localhost:7860/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" business, communication, community, education, health"
" greek, hellenistic, language, papyrus, text"
" greek, hellenistic, language, papyrus, text"
" greek, hellenistic, language, papyrus, text"
" greek, hellenistic, language, papyrus, text"
Even on CPU inference I can't get any determinism from the example server at all:
$ for x in `seq 5`; do curl -s -X POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "Cells interlinked, within cells interlinked, within a single stem.", "n_predict": 16, "seed":1337}' | jq '.content' ; done
"\nA network of life, the most basic of forms, intertwined."
"\nA single stem, from which all others come.\nThis is not about"
" The stem is the source of life, the starting point for all. Leaves"
" The stem is the root of all life,\nFrom which the branches extend,"
" The stem is the root of all life,\nFrom which roots grow in multiple"
I'll probably try using the library directly, but I'm trying to use it in Rust and 2 of the popular Rust wrappers already had issues where they segfault or act odd, which is strange, because the C++ code itself hasn't segfaulted on me yet.
We seem to be ignoring the provided seed:
Probably a regression after the parallel decoding support to server
was merged
I could check it out as long as I'm here. I found the directions to run the CI locally, is there anything else I should do to test a patch?
Edit: Just need to free up some space on my disk, didn't realize the CI would need 15 35 GB to run lol. If someone ninja patches it before I get back to it, please document that in the CI's readme?
Probably a regression after the parallel decoding support to
server
was merged
I don't think there's currently any sane way to handle this in server
(or just any case where you're doing parallel generation and you want different sequences to use a different seed).
The problem is the RNG is stored in the context and you can't save/restore its state, you can only set the seed. I think some backend stuff would need to change before this could actually be fixed. Like putting the RNG into the sampler state stuff, then each sequence could have its own RNG initialized to different seeds.
We can split this off if my issue on CPU is different, but on bbecf3f4 it seems to work with cache_prompt false? Whatever commit I was using before didn't work, I think.
Is this intended behavior for cache_prompt? If not, is there a place I should add a regression test? I have a use case where I just need the LLM to be a classifier and I don't want it to be creative at all.
Hmm, we might be talking about slightly different things. It looks like you're using temperature 0 so RNG stuff shouldn't affect you at all. The problem I was talking about would apply in the case when the RNG would have an effect (so temperature > 0) and also parallel sequences that should each use a different seed.
So basically I think you can pretty much disregard both my comment and GG's about the RNG seed thing. It should not make a difference in your particular case.
Okay. I just happened to find my CPU inference issue by searching the issue tracker for "determinism" so I'm not sure if I should start a new issue or what, mine isn't CUDA-related.
I can confirm there's something wrong with cache_prompt: true
. I just found the exact same problem of @ReactorScram while using the llama.cpp server API.
Okay cool. I don't know exactly how it's supposed to work but I assume "cache" means it should yield the same output but faster, right? I did notice the 2nd inference for the same prompt is usually quicker. But it doesn't feed in previous prompts, it just caches the state in case 2 prompts share a prefix?
@ggerganov I tried un-commenting that line, but it doesn't seem to compile because the seed can only be set on the llama_context, which is server-wide, and the requests come to a specific slot, and the slot only has a sampling context, for which I cannot set the seed.
I noticed that even if I disable caching, the seed isn't used - At temperature 0.0, the output never changes even if the seed changes. At 0.7 the output always changes, even if the seed doesn't.
main
works as expected, so I'll just invoke main for my use case for now. It's fast enough if the model is mmaped, even without the caching feature.
I set up Doxygen in my fork so anything I learn as a new reader / contributor will have a place to be added. I'll make a PR for that if it's fine. Here's my local git log:
1650511 (HEAD -> master, me/master) server : extract assign_work_to_slot to reduce max indentation and function length
3fbfdec common : set up Doxygen comments in a few places
06de72f server : fix uninitialized members
a46349a docs : add Doxygen config
5d231da ci : document free disk space needed
b691e61 gitignore : build-info.h
I tried un-commenting that line, but it doesn't seem to compile
Yes, it got disabled when the sampling context stuff got added. The reason presumably was because it wasn't a simple fix.
But like I mentioned, that shouldn't really have anything to do with your issue because you're using temperature 0. If the issue only occurs when you have prompt caching turned on, then if you want to try to fix the issue you should probably look at the differences between when prompt caching is disabled vs enabled.
I confirm this behavior too. When disabling the KV cache, the output is deterministic. When keeping it enabled, the first output is different from the following ones (second, third, etc.), even when setting deterministic params (seed, temp set to 0, fixed ctx len, etc.). It seems like the cache is not updated with the same state as has been initialized.
When layers are offloaded with CUDA, sending identical requests to the examples/server completion API returns a different response the "first time":
This seems cache related as responses then remain the same until a different prompt is processed, after which the differing first response occurs again:
Expected Behaviour
Output should remain the same when parameters and seed are constant.
Other Observations
-ngl 3
but not with-ngl 2
).Environment
LLAMA_CUBLAS=1 make -j
Thanks for reading! 😎