tpfau commented 9 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

When a large query, comes in and max_tokens is not set but the model still fails with token limit exceeded this should be reflected in the error response.

Current Behavior

When the input tokens are too big, in the error handling another error happens, due to line 209 in app.py which combines the None of completion_tokens with the prompt_tokens: completion_tokens + prompt_tokens Essentially completion_tokens should be checked for None in the function or set a non None value for the max_tokens number

Environment and Context

Irrelevant

Physical (or virtual) hardware you are using, e.g. for Linux:

Irrelevant

Operating System, e.g. for Linux:

Irrelevant

SDK version, e.g. for Linux:

Irrelevant

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Spin up a llamacpp server Send a request with more than ~2048 tokens without specifying a max_tokens parameter

Failure Logs

INFO:     130.233.8.131:0 - "POST /v1/chat/completions HTTP/1.0" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/llama_cpp/server/app.py", line 304, in custom_route_handler
    response = await original_route_handler(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/fastapi/routing.py", line 274, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_cpp/server/app.py", line 852, in create_chat_completion
    first_response = await run_in_threadpool(next, iterator_or_completion)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_cpp/llama_chat_format.py", line 222, in _convert_text_completion_chunks_to_chat
    for i, chunk in enumerate(chunks):
  File "/usr/local/lib/python3.12/site-packages/llama_cpp/llama.py", line 1415, in _create_completion
    raise ValueError(
ValueError: Requested tokens (6692) exceed context window of 2048

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette_context/middleware/raw_middleware.py", line 92, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.12/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/usr/local/lib/python3.12/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_cpp/server/app.py", line 334, in custom_route_handler
    ) = self.error_message_wrapper(error=exc, body=body)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_cpp/server/app.py", line 283, in error_message_wrapper
    return callback(body, match)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_cpp/server/app.py", line 209, in context_length_exceeded
    completion_tokens + prompt_tokens,
    ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

K-Mistele commented 9 months ago

+1 on this one

K-Mistele commented 9 months ago

@tpfau I'm happy to work on a PR for this one - what do we think the ideal behavior is? If no max_tokens is specified, to just not use a cap? or just use the n_ctx context size as the max?

K-Mistele commented 9 months ago

Possibly a duplicate of / related to #111 ?

tpfau commented 9 months ago

Honestly: I think it doesn't matter what to use if nothing is specified. I'd probably say no limit since this would be the intuitive limit or use something like the current max token limit on OpenAI models or something similar.

abetlen / llama-cpp-python

max_tokens is None leads to internal server error #983

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs