Closed sungkim11 closed 7 months ago
Thanks for opening this @sungkim11 ! I think I've run into this myself (the running server seems to get into a bad state and return internal server error for all requests :S) - I've been chalking this up to there not being enough resources available but maybe it's something else.
Can you share any other details around how you're running this? e.g. how many CPUs you have and how much memory?
Running this in WSL2, which has 20 CPU threads and 32GB of RAM per htop.
I'm seeing the same thing, 16GB heap available here
having same issue with i5-5300U cpu (4 core), 16 GB DDR3 ram, run in arch linux toolbox in fedora silverblue host. Is there any path the webserver stores the logs? It would help debugging, just 500 error from API doesn't tell me much :D
I think the problem is with the python code used for run and interact with model. When I run the model just by command that llm cli is using
python3 -m llama_cpp.server --model (path to model .gguf file downloaded from huggingface ) --host 0.0.0.0 --port 8000 --verbose true
and use the uvicorn api that server is exposing (http://localhost:8000/docs#/default/create_completion_v1_completions_post) I can interact with model more than once per run :)
btw i used /v1/completion endpoint, that seems the only one endpoint that responds pretty quickly on my pc
Is there any path the webserver stores the logs? It would help debugging, just 500 error from API doesn't tell me much :D
very good point XD I've opened #16 to make an incremental step forward here
I can interact with model more than once per run :)
oh interesting, my experience was that i wasnt having any trouble interacting with the model multiple times, but I can't say 100% for sure without experimenting some more.
I think the problem is with the python code used for run and interact with model.
It definitely could be though once the process is running, that python code is no longer involved so I'm not sure what it would be impacting, the python wrapper just runs that same command as a separate process:
Assuming I'm hitting the same issue (which I'm reproing by running this on a 4 CPU cloud workstation), I've added some logging capabilities in #18 and when I get the internal server error, this is what I see:
2024-02-15 02:47:44,259 - uvicorn.error - ERROR - Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/llama_cpp/server/errors.py", line 170, in custom_route_handler
response = await original_route_handler(request)
File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
raise e
File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.10/site-packages/llama_cpp/server/app.py", line 364, in create_chat_completion
] = await run_in_threadpool(llama.create_chat_completion, **kwargs)
File "/usr/local/lib/python3.10/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 1611, in create_chat_completion
return handler(
File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama_chat_format.py", line 350, in chat_completion_handler
completion_or_chunks = llama.create_completion(
File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 1449, in create_completion
completion: Completion = next(completion_or_chunks) # type: ignore
File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 975, in _create_completion
for token in self.generate(
File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 645, in generate
print("Llama.generate: prefix-match hit", file=sys.stderr)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 412, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 91, in __call__
await self.simple_response(scope, receive, send, request_headers=headers)
File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 146, in simple_response
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette_context/middleware/raw_middleware.py", line 92, in __call__
await self.app(scope, receive, send_wrapper)
File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 778, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 79, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
response = await func(request)
File "/usr/local/lib/python3.10/site-packages/llama_cpp/server/errors.py", line 203, in custom_route_handler
) = self.error_message_wrapper(error=exc, body=body)
File "/usr/local/lib/python3.10/site-packages/llama_cpp/server/errors.py", line 136, in error_message_wrapper
print(f"Exception: {str(error)}", file=sys.stderr)
BrokenPipeError: [Errno 32] Broken pipe
I think the problem is with the python code used for run and interact with model.
@pcfighter you were totally right of course XD The problem IS in that tiny snippet of code above 🤣
File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 645, in generate
print("Llama.generate: prefix-match hit", file=sys.stderr)
BrokenPipeError: [Errno 32] Broken pipe
I'm pretty sure what's happening is that when I start the subprocess, I'm providing a pipe for stderr (and stdout), then im immediately closing it, so later on when llama-cpp-python tries to write to stderr, it throws a broken pipe exception 🤦♀️
And it looks like llama-cpp-python does that whenever there's a prefix match cache hit, so it kinda makes sense that running the same prompt more than once would cause this to happen.
Anyway putting in a quick fix to prevent this from happening 🙏 and opening an issue for the better fix (to not be breaking this pipe in the first place!)
Install the tools
pip3 install openai pip3 install ./llm-tool/.
llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000
python3 querylocal.py
Actual Result: Works!
Run python3 querylocal.py again
Fails
http://localhost:8000/v1 Traceback (most recent call last): File "/home/username/localllm/querylocal.py", line 40, in
chat_completion = client.chat.completions.create(
File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_utils/_utils.py", line 271, in wrapper
return func(*args, **kwargs)
File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 659, in create
return self._post(
File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 1200, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 889, in request
return self._request(
File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 965, in _request
return self._retry_request(
File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 1013, in _retry_request
return self._request(
File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 965, in _request
return self._retry_request(
File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 1013, in _retry_request
return self._request(
File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 980, in _request
raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Internal Server Error