continuedev / continue

⏩ Continue is the leading open-source AI code assistant. You can connect any models and any context to build custom autocomplete and chat experiences inside VS Code and JetBrains
Apache License 2.0
16.65k stars 1.3k forks source link

The contextLength setting is not effective #1364

Open kit1858644 opened 4 months ago

kit1858644 commented 4 months ago

Before submitting your bug report

Relevant environment info

- OS: macOS 12.7.4
- Continue: 0.8.31
- IDE: Vscode
Model: gpt-4o


I set the contextLength to 4096 in the config (as shown below), but the post request sent by continue is 1024:

continue config:

{ "title": "GPT-4o", "model": "gpt-4o", "provider": "openai", "contextLength": 4096, "apiKey": "sk-xxx", "apiBase": "", "systemMessage": "You are an expert software developer. You give helpful and concise responses." }

continue post request:

{ "messages": [ { "role": "system", "content": "You are an expert software developer. You give helpful and concise responses." }, { "role": "user", "content": "hello" } ], "model": "gpt-4o", "max_tokens": 1024, "stream": true }

To reproduce

No response

Log output

No response

sealad886 commented 4 months ago

Can confirm this happens with llama.cpp as well.

config file:

      "title": "Codellama-34b Instruct (llama.cpp)",
      "provider": "llama.cpp",
      "model": "models/quants/CodeLlama-34b-Instruct-q4_k.gguf",
      "apiBase": "http://localhost:8080",
      "contextLength": 4096,  # <-- this should get passed as n_ctx into llama.cpp but doesn't for some reason
      "completionOptions": {
        "temperature": 0.2,
        "mirostat": 2

output from /.server <with options>:

{"tid":"0x1f9303ac0","timestamp":1716850340,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":1026,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":0}
{"tid":"0x1f9303ac0","timestamp":1716850351,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":1026,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":0}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"print_timings","line":321,"msg":"prompt eval time     =     421.05 ms /    82 tokens (    5.13 ms per token,   194.75 tokens per second)","id_slot":0,"id_task":1026,"t_prompt_processing":421.047,"n_prompt_tokens_processed":82,"t_token":5.134719512195122,"n_tokens_second":194.75260481608942}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"print_timings","line":337,"msg":"generation eval time =   45600.84 ms /  1024 runs   (   44.53 ms per token,    22.46 tokens per second)","id_slot":0,"id_task":1026,"t_token_generation":45600.843,"n_decoded":1024,"t_token":44.5320732421875,"n_tokens_second":22.455725215430775}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"print_timings","line":347,"msg":"          total time =   46021.89 ms","id_slot":0,"id_task":1026,"t_prompt_processing":421.047,"t_token_generation":45600.843,"t_total":46021.89}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"update_slots","line":1794,"msg":"slot released","id_slot":0,"id_task":1026,"n_ctx":512,"n_past":340,"n_system_tokens":0,"n_cache_tokens":0,"truncated":true}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"update_slots","line":1812,"msg":"all slots are idle"}
CambridgeComputing commented 3 months ago

I just ran a bunch of tests outside of Continue by running Curl commands such as this:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128, "n_ctx": 2048}'

The llama.cpp server output always showed the context length that the model was loaded with - the default is 512, but I also loaded mine with 16384. The verbose output form the terminal on the server always reflected n_ctx as whatever was defined at model load. I looked through the documentation here and it does not look like you can change the context length after the model is loaded - it is only included in returned JSON for logging information.