sungkim11 commented 7 months ago

Install the tools

pip3 install openai pip3 install ./llm-tool/.

llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000

python3 querylocal.py

Actual Result: Works!

Run python3 querylocal.py again

Fails

http://localhost:8000/v1 Traceback (most recent call last): File "/home/username/localllm/querylocal.py", line 40, in chat_completion = client.chat.completions.create( File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_utils/_utils.py", line 271, in wrapper return func(*args, **kwargs) File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 659, in create return self._post( File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 1200, in post return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 889, in request return self._request( File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 965, in _request return self._retry_request( File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 1013, in _retry_request return self._request( File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 965, in _request return self._retry_request( File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 1013, in _retry_request return self._request( File "/home/username/miniconda3/envs/localllm/lib/python3.10/site-packages/openai/_base_client.py", line 980, in _request raise self._make_status_error_from_response(err.response) from None openai.InternalServerError: Internal Server Error

bobcatfish commented 7 months ago

Thanks for opening this @sungkim11 ! I think I've run into this myself (the running server seems to get into a bad state and return internal server error for all requests :S) - I've been chalking this up to there not being enough resources available but maybe it's something else.

Can you share any other details around how you're running this? e.g. how many CPUs you have and how much memory?

sungkim11 commented 7 months ago

Running this in WSL2, which has 20 CPU threads and 32GB of RAM per htop.

jordanh commented 7 months ago

I'm seeing the same thing, 16GB heap available here

pcfighter commented 7 months ago

having same issue with i5-5300U cpu (4 core), 16 GB DDR3 ram, run in arch linux toolbox in fedora silverblue host. Is there any path the webserver stores the logs? It would help debugging, just 500 error from API doesn't tell me much :D

pcfighter commented 7 months ago

I think the problem is with the python code used for run and interact with model. When I run the model just by command that llm cli is using

python3 -m llama_cpp.server --model (path to model .gguf file downloaded from huggingface ) --host 0.0.0.0 --port 8000 --verbose true

and use the uvicorn api that server is exposing (http://localhost:8000/docs#/default/create_completion_v1_completions_post) I can interact with model more than once per run :)

btw i used /v1/completion endpoint, that seems the only one endpoint that responds pretty quickly on my pc

bobcatfish commented 7 months ago

Is there any path the webserver stores the logs? It would help debugging, just 500 error from API doesn't tell me much :D

very good point XD I've opened #16 to make an incremental step forward here

I can interact with model more than once per run :)

oh interesting, my experience was that i wasnt having any trouble interacting with the model multiple times, but I can't say 100% for sure without experimenting some more.

I think the problem is with the python code used for run and interact with model.

It definitely could be though once the process is running, that python code is no longer involved so I'm not sure what it would be impacting, the python wrapper just runs that same command as a separate process:

https://github.com/GoogleCloudPlatform/localllm/blob/af144c9f4ff93e67ece5cae5512fe77aa1de784d/llm-tool/modelserving.py#L66-L86

bobcatfish commented 7 months ago

Assuming I'm hitting the same issue (which I'm reproing by running this on a 4 CPU cloud workstation), I've added some logging capabilities in #18 and when I get the internal server error, this is what I see:

2024-02-15 02:47:44,259 - uvicorn.error - ERROR - Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/llama_cpp/server/errors.py", line 170, in custom_route_handler
    response = await original_route_handler(request)
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/site-packages/llama_cpp/server/app.py", line 364, in create_chat_completion
    ] = await run_in_threadpool(llama.create_chat_completion, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 1611, in create_chat_completion
    return handler(
  File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama_chat_format.py", line 350, in chat_completion_handler
    completion_or_chunks = llama.create_completion(
  File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 1449, in create_completion
    completion: Completion = next(completion_or_chunks)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 975, in _create_completion
    for token in self.generate(
  File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 645, in generate
    print("Llama.generate: prefix-match hit", file=sys.stderr)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 412, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 91, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 146, in simple_response
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette_context/middleware/raw_middleware.py", line 92, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/site-packages/llama_cpp/server/errors.py", line 203, in custom_route_handler
    ) = self.error_message_wrapper(error=exc, body=body)
  File "/usr/local/lib/python3.10/site-packages/llama_cpp/server/errors.py", line 136, in error_message_wrapper
    print(f"Exception: {str(error)}", file=sys.stderr)
BrokenPipeError: [Errno 32] Broken pipe

bobcatfish commented 7 months ago

I think the problem is with the python code used for run and interact with model.

@pcfighter you were totally right of course XD The problem IS in that tiny snippet of code above 🤣

  File "/usr/local/lib/python3.10/site-packages/llama_cpp/llama.py", line 645, in generate
    print("Llama.generate: prefix-match hit", file=sys.stderr)
BrokenPipeError: [Errno 32] Broken pipe

I'm pretty sure what's happening is that when I start the subprocess, I'm providing a pipe for stderr (and stdout), then im immediately closing it, so later on when llama-cpp-python tries to write to stderr, it throws a broken pipe exception 🤦‍♀️

And it looks like llama-cpp-python does that whenever there's a prefix match cache hit, so it kinda makes sense that running the same prompt more than once would cause this to happen.

Anyway putting in a quick fix to prevent this from happening 🙏 and opening an issue for the better fix (to not be breaking this pipe in the first place!)

GoogleCloudPlatform / localllm

Followed the instruction - running locally. Runs once then fails afterward #7

Install the tools