Open Priestru opened 1 year ago
Yeah, I am having the same issue.
I tricked it into working by increasing n_ctx to 2400
Output generated in 59.23 seconds (0.20 tokens/s, 12 tokens, context 2049, seed 461475505)
Also 2049 wtf.
Happily running with a 8196
context size and can fit a 13B model onto my 11GB GTX 1080Ti:
llama.cpp: loading model from /usr/src/llama-cpp-telegram_bot/models/model.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 8196
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 2282.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB
....................................................................................................
llama_init_from_file: kv self size = 6403.12 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama_print_timings: load time = 576.09 ms
llama_print_timings: sample time = 573.46 ms / 200 runs ( 2.87 ms per token)
llama_print_timings: prompt eval time = 576.01 ms / 88 tokens ( 6.55 ms per token)
llama_print_timings: eval time = 13484.66 ms / 199 runs ( 67.76 ms per token)
llama_print_timings: total time = 22735.88 ms
Do context sizes beyond 2048 make any sense for llama based models that just have been trained up to that context size of 2048?
I couldn't get the perplexity
llama.cpp
benchmark working for context sizes larger than 2048. 8196 was the default that I inherited via llama-cpp-telegram_bot
, and there doesn't seem to be much of a performance hit :man_shrugging:
Don't get me wrong. I'm not trying to go beyond 2048, i'm trying to force model run within 2048 context size. It's current self imposed limit is around for 1650 due to some bug. Workaround only tricks it work "as intended"
Happily running with a
8196
context size and can fit a 13B model onto my 11GB GTX 1080Ti:llama.cpp: loading model from /usr/src/llama-cpp-telegram_bot/models/model.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 8196 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 2282.48 MB (+ 1608.00 MB per state) llama_model_load_internal: [cublas] offloading 40 layers to GPU llama_model_load_internal: [cublas] offloading output layer to GPU llama_model_load_internal: [cublas] total VRAM used: 9076 MB .................................................................................................... llama_init_from_file: kv self size = 6403.12 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | llama_print_timings: load time = 576.09 ms llama_print_timings: sample time = 573.46 ms / 200 runs ( 2.87 ms per token) llama_print_timings: prompt eval time = 576.01 ms / 88 tokens ( 6.55 ms per token) llama_print_timings: eval time = 13484.66 ms / 199 runs ( 67.76 ms per token) llama_print_timings: total time = 22735.88 ms
Since your prompt processing was just 88 tokens, I'm not sure I'm getting your point here. This has nothing to do with the discussion.
Try sending a large first prompt (around 1800 tokens but below 2048 with n_ctx=2048). Then it will generate 0 tokens. Judging by your data you were just sending small prompts to the model which was not our point at all. Our point is that when you send a large prompt even below 2048 ctx, the AI will not generate anything. And when chatting with the model, the max ctx is around 1600 instead of 2048.
Try sending a large first prompt (around 1800 tokens but below 2048 with n_ctx=2048). Then it will generate 0 tokens. Judging by your data you were just sending small prompts to the model which was not our point at all. Our point is that when you send a large prompt even below 2048 ctx, the AI will not generate anything. And when chatting with the model, the max ctx is around 1600 instead of 2048.
Can you send a reproducible example? This has not been my experience using long prompts and a context size of 8192. It is possible of course that you're hitting an edge case.
I use ooba here (actually it's SillyTavern and ooba works as api), but ooba itself isn't responsible for anything.
llama_print_timings: load time = 15186.81 ms
llama_print_timings: sample time = 15.96 ms / 104 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 34742.91 ms / 1531 tokens ( 22.69 ms per token)
llama_print_timings: eval time = 70428.48 ms / 103 runs ( 683.77 ms per token)
llama_print_timings: total time = 106595.29 ms
Output generated in 106.89 seconds (0.96 tokens/s, 103 tokens, context 1978, seed 23583742)
Llama.generate: prefix-match hit
llama_print_timings: load time = 15186.81 ms
llama_print_timings: sample time = 7.74 ms / 50 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token)
llama_print_timings: eval time = 33804.06 ms / 50 runs ( 676.08 ms per token)
llama_print_timings: total time = 34525.68 ms
Output generated in 34.82 seconds (1.41 tokens/s, 49 tokens, context 1978, seed 1224656424)
Llama.generate: prefix-match hit
llama_print_timings: load time = 15186.81 ms
llama_print_timings: sample time = 13.67 ms / 85 runs ( 0.16 ms per token)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token)
llama_print_timings: eval time = 59195.54 ms / 85 runs ( 696.42 ms per token)
llama_print_timings: total time = 60237.05 ms
Output generated in 60.55 seconds (1.39 tokens/s, 84 tokens, context 1978, seed 507786728)
This is results with workaround where i load model with n_ctx = 2400. As you can see it generates smoothly at context close to 2k.
Now, i'll change nothing, literally same prompt but i'll reload model with n_ctx = 2048
INFO:Loading Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin...
INFO:llama.cpp weights detected: D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin
INFO:Cache capacity is 0 bytes
llama.cpp: loading model from D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: mem required = 25573.14 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB
.
llama_init_from_file: kv self size = 3120.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 2.25 seconds.
Here we go:
Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 1483144509)
Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 457531329)
Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 1258266880)
Output generated in 0.29 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 173099458)
Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 1011850065)
Fails successfully as expected. Now i will load back to larger n_ctx
INFO:Loading Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin...
INFO:llama.cpp weights detected: D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin
INFO:Cache capacity is 0 bytes
llama.cpp: loading model from D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2400
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: mem required = 25573.14 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB
.
llama_init_from_file: kv self size = 3656.25 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 2.33 seconds.
and now it works again as expected:
llama_print_timings: load time = 17918.28 ms
llama_print_timings: sample time = 36.63 ms / 236 runs ( 0.16 ms per token)
llama_print_timings: prompt eval time = 53098.17 ms / 1978 tokens ( 26.84 ms per token)
llama_print_timings: eval time = 160671.30 ms / 235 runs ( 683.71 ms per token)
llama_print_timings: total time = 215008.68 ms
Output generated in 215.29 seconds (1.09 tokens/s, 235 tokens, context 1978, seed 598884865)
Windows 11. CuBLAS, latest version of everything: https://github.com/abetlen/llama-cpp-python https://github.com/oobabooga/text-generation-webui https://github.com/SillyTavern/SillyTavern
Admittedly the perplexity isn't at all good, but as per @jmtatsch that's likely due to llama's designed context of 2048:
$ pip list | grep llama
llama-cpp-python 0.1.57
$ python ./high_level_api_inference.py
llama.cpp: loading model from /data/llama/7B/ggml-model-f16.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 8192
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 2292.09 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 12353 MB
...................................................................................................
llama_init_from_file: kv self size = 4096.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama_print_timings: load time = 1845.23 ms
llama_print_timings: sample time = 1255.23 ms / 2048 runs ( 0.61 ms per token)
llama_print_timings: prompt eval time = 10921.96 ms / 2254 tokens ( 4.85 ms per token)
llama_print_timings: eval time = 301437.04 ms / 2047 runs ( 147.26 ms per token)
llama_print_timings: total time = 366120.17 ms
{
"id": "cmpl-70bf6f1a-1aa8-401d-ab2f-e9302aa61cd8",
"object": "text_completion",
"created": 1686154862,
"model": "/data/llama/7B/ggml-model-f16.bin",
"choices": [
{
"index": 0,
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 2254,
"completion_tokens": 2048,
"total_tokens": 4302
}
}
And god saw it wasn't good so he gave us https://huggingface.co/epfml/landmark-attention-llama7b-wdiff
And god saw it wasn't good so he gave us https://huggingface.co/epfml/landmark-attention-llama7b-wdiff
Now let's pray to the llama gods of Apache 2.0 that they also attend to our pleas for large contexts.
Try sending a large first prompt (around 1800 tokens but below 2048 with n_ctx=2048). Then it will generate 0 tokens. Judging by your data you were just sending small prompts to the model which was not our point at all. Our point is that when you send a large prompt even below 2048 ctx, the AI will not generate anything. And when chatting with the model, the max ctx is around 1600 instead of 2048.
Can you send a reproducible example? This has not been my experience using long prompts and a context size of 8192. It is possible of course that you're hitting an edge case.
Sure. I'm using Ooba, but according to Priestru is a general issue with llama.cpp python.
I've loaded a ggml 5_1 13b model in Ooba with max context of 2048, max new tokens 200 (this is default and important. When trying to reproduce the issue, please do not use 8K context).
Then I send this prompt to the model which is just a bit over 1900 tokens:
Please complete the following text:
The generation immediately stops, this is how it looks in the WebUI:
In commandline: "Output generated in 0.31 seconds (0.00 tokens/s, 0 tokens, context 1933, seed 1692940365)"
Please refer to #307 This is the exact same issue. We don't want longer context than 2048 (atleast right now) we want to send long prompts within the 2040 tokens window without the generation stopping entirely.
Sure. I'm using Ooba, but according to Priestru is a general issue with llama.cpp python.
Again, I don't see the problem with llama-cpp-python
. I even went to the effort to try and reproduce it from your description, but couldn't.
Sure. I'm using Ooba, but according to Priestru is a general issue with llama.cpp python.
Again, I don't see the problem with
llama-cpp-python
. I even went to the effort to try and reproduce it from your description, but couldn't.
Which OS are you running? I've noticed Priestru and me are using the same OS (Windows 11).
BTW just because you can't reproduce it, doesn't mean the issue is invalid.
I encountered this issue on Ubuntu 22.04 (GeForce 1080 ti, if that matters).
I encountered this issue on Ubuntu 22.04 (GeForce 1080 ti, if that matters).
Can you post your code, please?
This happened with text-generation-webui. Sorry for not mentioning that.
This happened with text-generation-webui. Sorry for not mentioning that.
No problem. I'm sure any text-generation-webui developer reading this issue will jump in and fix it immediately.
Sarcasm aside, text-generation-webui uses this library for text generation for llama based models, which is why OP opened this issue in the first place.
I get short responses that are cut off when I use stream completions in server mode, is this related?
Are you sure you're just not hitting the generation limit? That's usually the case when that happens to me.
I get short responses that are cut off when I use stream completions in server mode, is this related?
It could well be. Do you have a curl request to easily reproduce the problem?
I spent an hour trying to reproduce the problem from the OP's limited description, but without the specifics of exactly how llama-cpp-python is being called the issue is likely not going to get identified and fixed.
I get very long answers to the same query with curl when I don't stream. Can I even stream with curl though? I'm streaming with the openai Python api.
Knowing that it is an issue with the streaming API helps, thanks. It explains why I couldn't reproduce it with the high level API example.
When this happens, the finish_reason is "length" by the way. Happens with stream and no-stream python clients so maybe it's just me. With curl the reponse is nice and finish_reason is "stop".
Earlier today I asked @abetlen to look into this more.
Can I please confirm the versions of text-generation-webui
and llama-cpp-python
people are using?
I used text-generation-webui
v1.3.1, and llama-cpp-python
v0.1.57.
llama-cpp-python
version 0.1.59 here.
Using 0.1.59 as well. But I don't know how to check the version for Textgen.
0.1.59 for cpp-python, but bug has been present in previous one too. About ooba i can only say that i use the latest one.
Also ooba there is another issue of somewhat similar kind that seems to going to be dismissed as cpp-python problem, i feel somewhat reserved to create new issues at this point.
https://github.com/oobabooga/text-generation-webui/issues/2576#issuecomment-1583339709
I hit this again, and I disabled text streaming in text-generation-webui to understand what's happening. This is what I got on the console:
Traceback (most recent call last):
File "/home/alex/ai/text-generation-webui/modules/text_generation.py", line 301, in generate_reply_custom
reply = shared.model.generate(context=question, **generate_params)
File "/home/alex/ai/text-generation-webui/modules/llamacpp_model.py", line 78, in generate
for completion_chunk in completion_chunks:
File "/home/alex/ai/text-generation-webui/venv310/lib/python3.10/site-packages/llama_cpp/llama.py", line 725, in _create_completion
raise ValueError(f"Requested tokens exceed context window of {self._n_ctx}")
ValueError: Requested tokens exceed context window of 2048
I hit this again, and I disabled text streaming in text-generation-webui to understand what's happening. This is what I got on the console:
Valuable info. Thx.
Also ooba there is another issue of somewhat similar kind that seems to going to be dismissed as cpp-python problem, i feel somewhat reserved to create new issues at this point.
In the issue you linked to is a stack trace that directly points to a llama-cpp-python
issue. Please log an issue, or I can for you.
I hit this again, and I disabled text streaming in text-generation-webui to understand what's happening. This is what I got on the console:
Traceback (most recent call last): File "/home/alex/ai/text-generation-webui/modules/text_generation.py", line 301, in generate_reply_custom reply = shared.model.generate(context=question, **generate_params) File "/home/alex/ai/text-generation-webui/modules/llamacpp_model.py", line 78, in generate for completion_chunk in completion_chunks: File "/home/alex/ai/text-generation-webui/venv310/lib/python3.10/site-packages/llama_cpp/llama.py", line 725, in _create_completion raise ValueError(f"Requested tokens exceed context window of {self._n_ctx}") ValueError: Requested tokens exceed context window of 2048
I think this narrows it down, create_completion would throw an error when len(prompt_tokens) + max_tokens > n_ctx
I've changed this to just truncate max_tokens. I'll publish an updated version.
@Priestru this is related to #183 actually, but thanks for reporting, I'll try to implement a fix that works outside of the server too.
The issue is that likely ooba is using a single Llama object in memory, when you click regenrate the previous request is still running but a new one comes in as well, this causes inconsistencies in the underlying library. Best workaround atm is what the llama-cpp-python server does and wrapping it in a lock but this is not a good solution as it doesn't allow for easy generation interruption.
Also ooba there is another issue of somewhat similar kind that seems to going to be dismissed as cpp-python problem, i feel somewhat reserved to create new issues at this point.
@Priestru I've created an oobabooga
label to track issues that are being reported indirectly via text-generation-webui. Ideally, they should include a stack trace such as @agronholm kindly provided which in turn enabled @abetlen to understand the issue.
0 token generation for larger prompts has been fixed in the newest update but now we have a new bug.
n_ctx is default (2048)
llama_print_timings: load time = 17090.64 ms
llama_print_timings: sample time = 2.11 ms / 14 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 48184.79 ms / 1989 tokens ( 24.23 ms per token)
llama_print_timings: eval time = 8830.68 ms / 13 runs ( 679.28 ms per token)
llama_print_timings: total time = 57064.15 ms
Output generated in 57.34 seconds (0.24 tokens/s, 14 tokens, context 2034, seed 740796631)
Llama.generate: prefix-match hit
127.0.0.1 - - [10/Jun/2023 12:37:13] "GET /api/v1/model HTTP/1.1" 200 -
llama_print_timings: load time = 17090.64 ms
llama_print_timings: sample time = 2.05 ms / 14 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token)
llama_print_timings: eval time = 9368.79 ms / 14 runs ( 669.20 ms per token)
llama_print_timings: total time = 9412.53 ms
Output generated in 9.68 seconds (1.45 tokens/s, 14 tokens, context 2034, seed 516672141)
Llama.generate: prefix-match hit
llama_print_timings: load time = 17090.64 ms
llama_print_timings: sample time = 2.03 ms / 14 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token)
llama_print_timings: eval time = 9426.21 ms / 14 runs ( 673.30 ms per token)
llama_print_timings: total time = 9475.47 ms
Output generated in 9.75 seconds (1.44 tokens/s, 14 tokens, context 2034, seed 1241799934)
It only generates until it hit of total 2048 tokens. Sum of initial prompt + output.
Previously discovered workaround saves the day once again because it allows it to generate normally.
I set n_ctx to 2500. It results in:
INFO:Cache capacity is 0 bytes
llama.cpp: loading model from D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2500
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 25573.14 MB (+ 3124.00 MB per state)
llama_model_load_internal: offloading 0 layers to GPU
llama_model_load_internal: total VRAM used: 512 MB
.
llama_init_from_file: kv self size = 3808.59 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 9.01 seconds.
127.0.0.1 - - [10/Jun/2023 12:41:46] "GET /api/v1/model HTTP/1.1" 200 -
127.0.0.1 - - [10/Jun/2023 12:43:17] "GET /api/v1/model HTTP/1.1" 200 -
127.0.0.1 - - [10/Jun/2023 12:44:47] "GET /api/v1/model HTTP/1.1" 200 -
llama_print_timings: load time = 17040.96 ms
llama_print_timings: sample time = 42.47 ms / 289 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 58792.80 ms / 2034 tokens ( 28.91 ms per token)
llama_print_timings: eval time = 197917.31 ms / 288 runs ( 687.21 ms per token)
llama_print_timings: total time = 258369.00 ms
Output generated in 258.64 seconds (1.11 tokens/s, 288 tokens, context 2034, seed 802723436)
1.62 or smth doesn't fix it
llama_print_timings: load time = 15599.80 ms
llama_print_timings: sample time = 5.32 ms / 35 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token)
llama_print_timings: eval time = 23978.73 ms / 35 runs ( 685.11 ms per token)
llama_print_timings: total time = 24616.61 ms
Output generated in 24.91 seconds (1.40 tokens/s, 35 tokens, context 2013, seed 971044292)
Should i make new issue to add visibility?
The llama 7B model is giving me very small responses, input and output as below:
Endpoint: http://localhost:PORT/v1/chat/completions
Request body:
{
"messages": [
{
"role": "system",
"content": "You are a helpful AI Assistant named MR AI"
},
{
"role": "user",
"content": "Write a big poem for me"
}
]
}
Response body:
{
"id": "chatcmpl-<id>",
"object": "chat.completion",
"created": <timestamp>,
"model": "llama.cpp/models/7B/ggml-model-q4_0.bin",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Are you sure? "
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 33,
"completion_tokens": 8,
"total_tokens": 41
}
}
[Update] Issue below is fixed with new bug emerging from the fix. See https://github.com/abetlen/llama-cpp-python/issues/331#issuecomment-1585536186