Closed JohannesGaessler closed 1 year ago
I accidentally opened this issue prematurely by pressing CTRL+Enter. I am not yet done with ensuring that everything is correct.
Everything should be in order now; sorry for the inconvenience.
Have you noticed if this also happens with smaller models (7B, 13B)?
The bug also occurs with 13b:
Dammit, I can't believe I read that whole thing
Dammit, I can't even get angry at this stupid
Dammit, I can't even get angry at this stupid
Dammit, I am so tired of having to deal with people
I think there are two possible causes for this:
Adding a cudaDeviceSynchronize()
at the end of the inner loop of the ggml_cuda_mul_mat_<f32/f16/q_f32>
functions, forcing each mat mul to be performed sequentially, may fix this at the cost of performance. Using a different cuBLAS handle per stream may also work, but can also affect performance.
Unless there is a bug with the multi-stream synchronization, I am not sure that we should do anything about it, unless this affects the generation quality. Note that the generation quality needs to be evaluated in an objective way, such as the perplexity.
I can confirm that the bug also occurs at 7b:
Labels: Al-Quaeda, Armed Forces, Gor
"Dead" as in "deceased"? Wow
Labels: Angry, Dumbasses, Funny,
Labels: Angry, Dumbasses, Funny,
I did not do any objective measure of generation quality. Subjectively I was not able to tell a difference in terms of quality. In any case, if cuBLAS does not guarantee reproducibility anyways then this is probably the reason. I was simply confused because this behavior made me question whether I accidentally introduced race conditions in https://github.com/ggerganov/llama.cpp/pull/1341 ; perhaps a warning should be printed when the user specifies a seed in combination with cuBLAS? In any case, I agree that this would probably not be worth sacrificing performance for.
Adding cudaDeviceSynchronize()
in the loop does not make a difference. When I set GGML_CUDA_MAX_STREAMS
to 1 the outputs become deterministic. In turn prompt processing seems to become ~1-2% slower. I think it's sufficient to somehow document this behavior.
I just ran perplexity tests for 8 CUDA streams vs. 1 stream. The perplexity of 7b q4_0 was 6.2838 for both configurations. 8 streams was 6% faster than 1 stream with 8.66 ms / token vs. 9.20 ms / token.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
When I set a seed and repeat a generation with the exact same parameters I expect to get the exact same text again.
Current Behavior
I re-run a generation with the same seed and parameters and the generated text is not always the same between generations. It is sometimes the same, but not always.
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Failure Information (for bugs)
I suspect that there is a race condition somewhere that affects the generated text, and depending on the race condition one of several outputs is produced. I only get the bug when compiling with
LLAMA_CUBLAS=1
. I only get the bug with a prompt that is sufficiently long (navy seals copypasta, 399 tokens) but not with a short prompt ("People die when they are killed.", 8 tokens). The number of threads does not matter. Quantization scheme does not matter.Steps to Reproduce
make clean && LLAMA_CUBLAS=1 make
./main --model models/llama-33b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt
with the filenavy_seals_copypasta.txt
containing the navy seals copypasta as a prompt (399 tokens).Failure Logs
Below is a log of my console when repeatedly running the same seed and parameters. Outputs are in order:
Labels: 4chan, epic win, fail, fun
Labels: 4chan, epic win, fail, fun
(thing) by Kalkin Tue Jul 10
You think this is abuse? This is how I treat people who