The max_tokens parameter in completion() is stated to be the maximum number of completion tokens requested. However starting at litellm/utils.py:821 the calculation of max_tokens for the api call doesn't reflect this behaviour.
In utils.py, max_output_tokens is obtained from get_max_tokens() which is correct, but later in the calculation it is treated more like max_context_size, because the size of the user input is subtracted from it before the call api call (lines 842-843).
Ironically, if the user input is larger than max_tokens, then the call often completes correctly, because that calculation doesn't take place and the call is assumed to fail (840-841):
if input_tokens > max_output_tokens:
pass # allow call to fail normally
What happened?
What happens:
For a hypothetical gpt-4o call with an input size of 4000 tokens:
The completion size will be ~96 tokens long.
Detailed Description
The
max_tokens
parameter incompletion()
is stated to be the maximum number of completion tokens requested. However starting at litellm/utils.py:821 the calculation ofmax_tokens
for the api call doesn't reflect this behaviour.In
utils.py
,max_output_tokens
is obtained fromget_max_tokens()
which is correct, but later in the calculation it is treated more likemax_context_size
, because the size of the user input is subtracted from it before the call api call (lines 842-843).Ironically, if the user input is larger than
max_tokens
, then the call often completes correctly, because that calculation doesn't take place and the call is assumed to fail (840-841):Relevant log output
No response
Twitter / LinkedIn details
No response