Open kit1858644 opened 4 months ago
Can confirm this happens with llama.cpp
as well.
config file:
{
"title": "Codellama-34b Instruct (llama.cpp)",
"provider": "llama.cpp",
"model": "models/quants/CodeLlama-34b-Instruct-q4_k.gguf",
"apiBase": "http://localhost:8080",
"contextLength": 4096, # <-- this should get passed as n_ctx into llama.cpp but doesn't for some reason
"completionOptions": {
"temperature": 0.2,
"mirostat": 2
}
},
output from /.server <with options>
:
{"tid":"0x1f9303ac0","timestamp":1716850340,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":1026,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":0}
{"tid":"0x1f9303ac0","timestamp":1716850351,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":1026,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":0}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"print_timings","line":321,"msg":"prompt eval time = 421.05 ms / 82 tokens ( 5.13 ms per token, 194.75 tokens per second)","id_slot":0,"id_task":1026,"t_prompt_processing":421.047,"n_prompt_tokens_processed":82,"t_token":5.134719512195122,"n_tokens_second":194.75260481608942}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"print_timings","line":337,"msg":"generation eval time = 45600.84 ms / 1024 runs ( 44.53 ms per token, 22.46 tokens per second)","id_slot":0,"id_task":1026,"t_token_generation":45600.843,"n_decoded":1024,"t_token":44.5320732421875,"n_tokens_second":22.455725215430775}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"print_timings","line":347,"msg":" total time = 46021.89 ms","id_slot":0,"id_task":1026,"t_prompt_processing":421.047,"t_token_generation":45600.843,"t_total":46021.89}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"update_slots","line":1794,"msg":"slot released","id_slot":0,"id_task":1026,"n_ctx":512,"n_past":340,"n_system_tokens":0,"n_cache_tokens":0,"truncated":true}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"update_slots","line":1812,"msg":"all slots are idle"}
I just ran a bunch of tests outside of Continue by running Curl commands such as this:
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128, "n_ctx": 2048}'
The llama.cpp server output always showed the context length that the model was loaded with - the default is 512, but I also loaded mine with 16384. The verbose output form the terminal on the server always reflected n_ctx as whatever was defined at model load. I looked through the documentation here and it does not look like you can change the context length after the model is loaded - it is only included in returned JSON for logging information.
Before submitting your bug report
Relevant environment info
Description
I set the contextLength to 4096 in the config (as shown below), but the post request sent by continue is 1024:
continue config:
continue post request:
To reproduce
No response
Log output
No response