continuedev / continue

⏩ Continue is the leading open-source AI code assistant. You can connect any models and any context to build custom autocomplete and chat experiences inside VS Code and JetBrains
https://docs.continue.dev/
Apache License 2.0
16.65k stars 1.3k forks source link

The contextLength setting is not effective #1364

Open kit1858644 opened 4 months ago

kit1858644 commented 4 months ago

Before submitting your bug report

Relevant environment info

- OS: macOS 12.7.4
- Continue: 0.8.31
- IDE: Vscode
Model: gpt-4o

Description

I set the contextLength to 4096 in the config (as shown below), but the post request sent by continue is 1024:

continue config:

{ "title": "GPT-4o", "model": "gpt-4o", "provider": "openai", "contextLength": 4096, "apiKey": "sk-xxx", "apiBase": "https://api.xxx.com/v1", "systemMessage": "You are an expert software developer. You give helpful and concise responses." }

continue post request:

{ "messages": [ { "role": "system", "content": "You are an expert software developer. You give helpful and concise responses." }, { "role": "user", "content": "hello" } ], "model": "gpt-4o", "max_tokens": 1024, "stream": true }

To reproduce

No response

Log output

No response

sealad886 commented 4 months ago

Can confirm this happens with llama.cpp as well.

config file:

    {
      "title": "Codellama-34b Instruct (llama.cpp)",
      "provider": "llama.cpp",
      "model": "models/quants/CodeLlama-34b-Instruct-q4_k.gguf",
      "apiBase": "http://localhost:8080",
      "contextLength": 4096,  # <-- this should get passed as n_ctx into llama.cpp but doesn't for some reason
      "completionOptions": {
        "temperature": 0.2,
        "mirostat": 2
      }
    },

output from /.server <with options>:

{"tid":"0x1f9303ac0","timestamp":1716850340,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":1026,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":0}
{"tid":"0x1f9303ac0","timestamp":1716850351,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":1026,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":0}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"print_timings","line":321,"msg":"prompt eval time     =     421.05 ms /    82 tokens (    5.13 ms per token,   194.75 tokens per second)","id_slot":0,"id_task":1026,"t_prompt_processing":421.047,"n_prompt_tokens_processed":82,"t_token":5.134719512195122,"n_tokens_second":194.75260481608942}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"print_timings","line":337,"msg":"generation eval time =   45600.84 ms /  1024 runs   (   44.53 ms per token,    22.46 tokens per second)","id_slot":0,"id_task":1026,"t_token_generation":45600.843,"n_decoded":1024,"t_token":44.5320732421875,"n_tokens_second":22.455725215430775}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"print_timings","line":347,"msg":"          total time =   46021.89 ms","id_slot":0,"id_task":1026,"t_prompt_processing":421.047,"t_token_generation":45600.843,"t_total":46021.89}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"update_slots","line":1794,"msg":"slot released","id_slot":0,"id_task":1026,"n_ctx":512,"n_past":340,"n_system_tokens":0,"n_cache_tokens":0,"truncated":true}
{"tid":"0x1f9303ac0","timestamp":1716850355,"level":"INFO","function":"update_slots","line":1812,"msg":"all slots are idle"}
CambridgeComputing commented 3 months ago

I just ran a bunch of tests outside of Continue by running Curl commands such as this:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128, "n_ctx": 2048}'

The llama.cpp server output always showed the context length that the model was loaded with - the default is 512, but I also loaded mine with 16384. The verbose output form the terminal on the server always reflected n_ctx as whatever was defined at model load. I looked through the documentation here and it does not look like you can change the context length after the model is loaded - it is only included in returned JSON for logging information.