Closed kgilpin closed 1 month ago
Improve Token Management to Prevent LLM Token Overflow
The model is encountering a BadRequestError
due to exceeding the maximum allowable context length of 128,000 tokens. This results in an invalid request error and prevents the completion of the task.
The root cause of the issue is that the input messages provided to the model exceed its maximum context length, resulting in a context_length_exceeded
error. To effectively handle this, the input messages must be truncated or summarized before being sent to the model. Additionally, proper error handling and logging should be implemented to manage such occurrences gracefully.
To resolve this, the solution involves the following steps:
File: swebench/inference/run_api.py
BadRequestError
to provide a more informative log and to attempt retrying with suitable adjustments.<!-- file: /home/runner/work/navie-benchmark/navie-benchmark/swebench/inference/run_api.py -->
def truncate_input(input_str, max_tokens, encoding):
"""Truncate input string to fit within the max token limit."""
tokens = encoding.encode(input_str)
if len(tokens) > max_tokens:
truncated_tokens = tokens[:max_tokens]
return encoding.decode(truncated_tokens)
return input_str
def generate_completions(inputs, model_name_or_path, temperature, top_p, model_args, encoding):
"""Generate completions with truncation."""
truncated_inputs = truncate_input(inputs, 128000, encoding)
system_messages = truncated_inputs.split("\n", 1)[0]
user_message = truncated_inputs.split("\n", 1)[1]
try:
response = openai.chat.completions.create(
model=model_name_or_path,
messages=[
{"role": "system", "content": system_messages},
{"role": "user", "content": user_message},
],
temperature=temperature,
top_p=top_p,
**model_args,
)
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
cost = calc_cost(response.model, input_tokens, output_tokens)
return response, cost
except openai.BadRequestError as e:
logger.error(f"BadRequestError: {e}")
if 'context_length_exceeded' in str(e):
logger.error("Token length exceeded the limit. Consider truncating the inputs.")
return None
raise e
File: swebench/inference/run_llama.py
<!-- file: /home/runner/work/navie-benchmark/navie-benchmark/swebench/inference/run_llama.py -->
def truncate_input(input_str, max_tokens, tokenizer):
"""Truncate input string to fit within the max token limit."""
tokens = tokenizer.encode(input_str)
if len(tokens) > max_tokens:
truncated_tokens = tokens[:max_tokens]
return tokenizer.decode(truncated_tokens)
return input_str
def generate_llama(inputs, model_name_or_path, tokenizer, temperature, top_p, model_args):
"""Generate outputs with truncation."""
truncated_inputs = truncate_input(inputs, 128000, tokenizer)
# Continue to use 'truncated_inputs' for generation logic here
File: swebench/inference/run_llama.py
Documented modifications in sequence function as placeholders for detailed implementation.
context_length_exceeded
errors gracefully.These changes ensure that the inputs passed to the model will comply with its token limit, thereby preventing context_length_exceeded
errors and allowing the task to be completed successfully.
Fixed in the backend, and we also limit the size of the test errors.
https://github.com/getappmap/navie-benchmark/actions/runs/10902413861/job/30254261579#step:7:855
https://github.com/getappmap/navie-benchmark/actions/runs/10902413861/job/30254259689#step:7:643
https://github.com/getappmap/navie-benchmark/actions/runs/10900852238/job/30249741000#step:7:784
https://github.com/getappmap/navie-benchmark/actions/runs/10903822668/job/30258786867#step:7:817
https://github.com/getappmap/navie-benchmark/actions/runs/10903822668/job/30258788765#step:7:797
https://github.com/getappmap/navie-benchmark/actions/runs/10928536987/job/30337535905