getappmap / navie-benchmark

Navie benchmarks
MIT License
0 stars 0 forks source link

LLM token overflow #38

Closed kgilpin closed 1 month ago

kgilpin commented 1 month ago

https://github.com/getappmap/navie-benchmark/actions/runs/10902413861/job/30254261579#step:7:855

https://github.com/getappmap/navie-benchmark/actions/runs/10902413861/job/30254259689#step:7:643

https://github.com/getappmap/navie-benchmark/actions/runs/10900852238/job/30249741000#step:7:784

https://github.com/getappmap/navie-benchmark/actions/runs/10903822668/job/30258786867#step:7:817

https://github.com/getappmap/navie-benchmark/actions/runs/10903822668/job/30258788765#step:7:797

https://github.com/getappmap/navie-benchmark/actions/runs/10928536987/job/30337535905

BadRequestError: 400 This model's maximum context length is 128000 tokens. However, your messages resulted in 145087 tokens. Please reduce the length of the messages.
    at APIError.generate (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/openai/error.js:45:20)
    at OpenAI.makeStatusError (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/openai/core.js:275:33)
    at OpenAI.makeRequest (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/openai/core.js:318:30)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/@langchain/openai/dist/chat_models.cjs:1306:29
    at async RetryOperation._fn (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/p-retry/index.js:50:12) {
  status: 400,
  headers: {
    'access-control-expose-headers': 'X-Request-ID',
    'alt-svc': 'h3=":443"; ma=86400',
    'cf-cache-status': 'DYNAMIC',
    'cf-ray': '8c48f0627ab107f1-IAD',
    connection: 'keep-alive',
    'content-length': '284',
    'content-type': 'application/json',
    date: 'Tue, 17 Sep 2024 12:01:00 GMT',
    'openai-organization': 'appland',
    'openai-processing-ms': '570',
    'openai-version': '2020-10-01',
    server: 'cloudflare',
    'set-cookie': '__cf_bm=hERDxd_ZPdwFt4WO_hUwED7E.M1NQB_xMCVQai3wLek-1726574460-1.0.1.1-3_Je1haI3kIsERBYVX4ecGo81KxbY4iND_kQkzpBf5lVy2YuhvjBWKyuh_krbWi2ByQuSxW6Wf4C.MVvuzHSyw; path=/; expires=Tue, 17-Sep-24 12:31:00 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None, _cfuvid=RgjuWKj2K0S_nQzpczywciQK95YBfxImMBYOmFAujHw-1726574460198-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None',
    'strict-transport-security': 'max-age=15552000; includeSubDomains; preload',
    'x-content-type-options': 'nosniff',
    'x-ratelimit-limit-requests': '10000',
    'x-ratelimit-limit-tokens': '30000000',
    'x-ratelimit-remaining-requests': '9999',
    'x-ratelimit-remaining-tokens': '29852333',
    'x-ratelimit-reset-requests': '6ms',
    'x-ratelimit-reset-tokens': '295ms',
    'x-request-id': 'req_0a07dd02f068a3cd2430a88ff62e6b15'
  },
  request_id: 'req_0a07dd02f068a3cd2430a88ff62e6b15',
  error: {
    message: "This model's maximum context length is 128000 tokens. However, your messages resulted in 145087 tokens. Please reduce the length of the messages.",
    type: 'invalid_request_error',
    param: 'messages',
    code: 'context_length_exceeded'
  },
  code: 'context_length_exceeded',
  param: 'messages',
  type: 'invalid_request_error',
  attemptNumber: 1,
  retriesLeft: 6
}
Handling exception: BadRequestError: 400 This model's maximum context length is 128000 tokens. However, your messages resulted in 145087 tokens. Please reduce the length of the messages.
    at APIError.generate (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/openai/error.js:45:20)
    at OpenAI.makeStatusError (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/openai/core.js:275:33)
    at OpenAI.makeRequest (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/openai/core.js:318:30)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/@langchain/openai/dist/chat_models.cjs:1306:29
    at async RetryOperation._fn (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/p-retry/index.js:50:12) {
  status: 400,
  headers: {
    'access-control-expose-headers': 'X-Request-ID',
    'alt-svc': 'h3=":443"; ma=86400',
    'cf-cache-status': 'DYNAMIC',
    'cf-ray': '8c48f0627ab107f1-IAD',
    connection: 'keep-alive',
    'content-length': '284',
    'content-type': 'application/json',
    date: 'Tue, 17 Sep 2024 12:01:00 GMT',
    'openai-organization': 'appland',
    'openai-processing-ms': '570',
    'openai-version': '2020-10-01',
    server: 'cloudflare',
    'set-cookie': '__cf_bm=hERDxd_ZPdwFt4WO_hUwED7E.M1NQB_xMCVQai3wLek-1726574460-1.0.1.1-3_Je1haI3kIsERBYVX4ecGo81KxbY4iND_kQkzpBf5lVy2YuhvjBWKyuh_krbWi2ByQuSxW6Wf4C.MVvuzHSyw; path=/; expires=Tue, 17-Sep-24 12:31:00 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None, _cfuvid=RgjuWKj2K0S_nQzpczywciQK95YBfxImMBYOmFAujHw-1726574460198-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None',
    'strict-transport-security': 'max-age=15552000; includeSubDomains; preload',
    'x-content-type-options': 'nosniff',
    'x-ratelimit-limit-requests': '10000',
    'x-ratelimit-limit-tokens': '30000000',
    'x-ratelimit-remaining-requests': '9999',
    'x-ratelimit-remaining-tokens': '29852333',
    'x-ratelimit-reset-requests': '6ms',
    'x-ratelimit-reset-tokens': '295ms',
    'x-request-id': 'req_0a07dd02f068a3cd2430a88ff62e6b15'
  },
  request_id: 'req_0a07dd02f068a3cd2430a88ff62e6b15',
  error: {
    message: "This model's maximum context length is 128000 tokens. However, your messages resulted in 145087 tokens. Please reduce the length of the messages.",
    type: 'invalid_request_error',
    param: 'messages',
    code: 'context_length_exceeded'
  },
  code: 'context_length_exceeded',
  param: 'messages',
  type: 'invalid_request_error',
  attemptNumber: 1,
  retriesLeft: 6
}
Stack trace: Error: 400 This model's maximum context length is 128000 tokens. However, your messages resulted in 145087 tokens. Please reduce the length of the messages.
    at APIError.generate (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/openai/error.js:45:20)
    at OpenAI.makeStatusError (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/openai/core.js:275:33)
    at OpenAI.makeRequest (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/openai/core.js:318:30)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/@langchain/openai/dist/chat_models.cjs:1306:29
    at async RetryOperation._fn (/home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/node_modules/p-retry/index.js:50:12)
  instance_id: django__django-14915
  edit_test_file: tests/model_forms/test_modelchoicefield.py
  code_patch: True
  test_patch: True
  test_inverted_patch: True
  num_sent_chars: 1131750
  num_received_chars: 24489
  elapsed_time: 261.223[702](https://github.com/getappmap/navie-benchmark/actions/runs/10902413861/job/30254261579#step:7:703)90756226
  lint_repair_count: 0
  test_generation_attempts: 2
  code_generation_attempts: 9
  pass_to_pass: True
  pass_to_fail: True
  fail_to_pass: False
  code_patch_score: 2
  appmap_data_test_status: 
  appmap_data_file_count: 
  appmap_data_context_size:
  File "/home/runner/work/navie-benchmark/navie-benchmark/solver/workflow/generate_test.py", line 137, in generate
    ).test(
      ^^^^^
  File "/home/runner/work/navie-benchmark/navie-benchmark/submodules/navie-editor/navie/editor.py", line 472, in test
    with_cache(
  File "/home/runner/work/navie-benchmark/navie-benchmark/submodules/navie-editor/navie/with_cache.py", line 25, in with_cache
    result = implementation_func()
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/navie-benchmark/navie-benchmark/submodules/navie-editor/navie/editor.py", line 459, in _test
    self._build_client(work_dir).test(
  File "/home/runner/work/navie-benchmark/navie-benchmark/submodules/navie-editor/navie/client.py", line 304, in test
    self._execute(command, log_file)
  File "/home/runner/work/navie-benchmark/navie-benchmark/submodules/navie-editor/navie/client.py", line 378, in _execute
    raise RuntimeError(
RuntimeError: Failed to execute command APPMAP_NAVIE_TEMPERATURE=0.0 node /home/runner/work/navie-benchmark/navie-benchmark/submodules/appmap-js/packages/cli/built/cli.js navie --log-navie -i /home/runner/work/navie-benchmark/navie-benchmark/solve/pydata__xarray-6992/navie/generate-test/attempt-1_from-test_dataset.py/test-1/code-1/test/test.txt -p /home/runner/work/navie-benchmark/navie-benchmark/solve/pydata__xarray-6992/navie/generate-test/attempt-1_from-test_dataset.py/test-1/code-1/test/test.prompt.md --trajectory-file /home/runner/work/navie-benchmark/navie-benchmark/solve/pydata__xarray-6992/navie/trajectory.jsonl -o /home/runner/work/navie-benchmark/navie-benchmark/solve/pydata__xarray-6992/navie/generate-test/attempt-1_from-test_dataset.py/test-1/code-1/test/test.md > /home/runner/work/navie-benchmark/navie-benchmark/solve/pydata__xarray-6992/navie/generate-test/attempt-1_from-test_dataset.py/test-1/code-1/test/test.log 2>&1. See /home/runner/work/navie-benchmark/navie-benchmark/solve/pydata__xarray-6992/navie/generate-test/attempt-1_from-test_dataset.py/test-1/code-1/test/test.log for details.
github-actions[bot] commented 1 month ago

Title:

Improve Token Management to Prevent LLM Token Overflow

Problem:

The model is encountering a BadRequestError due to exceeding the maximum allowable context length of 128,000 tokens. This results in an invalid request error and prevents the completion of the task.

Analysis:

The root cause of the issue is that the input messages provided to the model exceed its maximum context length, resulting in a context_length_exceeded error. To effectively handle this, the input messages must be truncated or summarized before being sent to the model. Additionally, proper error handling and logging should be implemented to manage such occurrences gracefully.

To resolve this, the solution involves the following steps:

  1. Introduce pre-processing to truncate or summarize the input to fit within the token limits.
  2. Implement efficient error handling to log the incident and possibly retry with adjusted inputs.

Proposed Changes:

  1. File: swebench/inference/run_api.py

    • Add a function to truncate or summarize the input messages, ensuring they fit within the token limit.
    • Modify the existing API call functions to utilize this truncation function before sending the input to the model.
    • Enhance error handling for BadRequestError to provide a more informative log and to attempt retrying with suitable adjustments.
    <!-- file: /home/runner/work/navie-benchmark/navie-benchmark/swebench/inference/run_api.py -->
    def truncate_input(input_str, max_tokens, encoding):
        """Truncate input string to fit within the max token limit."""
        tokens = encoding.encode(input_str)
        if len(tokens) > max_tokens:
            truncated_tokens = tokens[:max_tokens]
            return encoding.decode(truncated_tokens)
        return input_str
    
    def generate_completions(inputs, model_name_or_path, temperature, top_p, model_args, encoding):
        """Generate completions with truncation."""
        truncated_inputs = truncate_input(inputs, 128000, encoding)
        system_messages = truncated_inputs.split("\n", 1)[0]
        user_message = truncated_inputs.split("\n", 1)[1]
        try:
            response = openai.chat.completions.create(
                model=model_name_or_path,
                messages=[
                    {"role": "system", "content": system_messages},
                    {"role": "user", "content": user_message},
                ],
                temperature=temperature,
                top_p=top_p,
                **model_args,
            )
            input_tokens = response.usage.prompt_tokens
            output_tokens = response.usage.completion_tokens
            cost = calc_cost(response.model, input_tokens, output_tokens)
            return response, cost
        except openai.BadRequestError as e:
            logger.error(f"BadRequestError: {e}")
            if 'context_length_exceeded' in str(e):
                logger.error("Token length exceeded the limit. Consider truncating the inputs.")
                return None
            raise e
  2. File: swebench/inference/run_llama.py

    • Add corresponding truncation processing where inputs are handled to follow similar approaches and prevent overflow.
    <!-- file: /home/runner/work/navie-benchmark/navie-benchmark/swebench/inference/run_llama.py -->
    def truncate_input(input_str, max_tokens, tokenizer):
        """Truncate input string to fit within the max token limit."""
        tokens = tokenizer.encode(input_str)
        if len(tokens) > max_tokens:
            truncated_tokens = tokens[:max_tokens]
            return tokenizer.decode(truncated_tokens)
        return input_str
    
    def generate_llama(inputs, model_name_or_path, tokenizer, temperature, top_p, model_args):
        """Generate outputs with truncation."""
        truncated_inputs = truncate_input(inputs, 128000, tokenizer)
        # Continue to use 'truncated_inputs' for generation logic here
  3. File: swebench/inference/run_llama.py Documented modifications in sequence function as placeholders for detailed implementation.

Summary:

  1. Add a truncation function: To process inputs to fit within token limits.
  2. Enhance error handling: To catch context_length_exceeded errors gracefully.
  3. Update existing functions: To include pre-processing steps to manage token limits effectively.

These changes ensure that the inputs passed to the model will comply with its token limit, thereby preventing context_length_exceeded errors and allowing the task to be completed successfully.

kgilpin commented 1 month ago

Fixed in the backend, and we also limit the size of the test errors.