I found a problem when using outlines with a local llm served with a Llama cpp backend on my Mac M3. When using multiple times consecutively the same generator and model for constrained generation, AsyncOpenAI event loop closes unexpectedly resulting in a retry.
The problem is that the request is still processed by my Llama cpp backend, resulting in double processing of every prompt when reusing the same generator. There is not exactly an error but I found the Traceback and retry when logging with level debug. I did not observe the error with a text generator.
For now I handle it by creating a new instance of model and generator everytime I want to process a new request to avoid this duplicate query but it doesn't seem right and maybe it hides something else. Thank you!
Llama.cpp output
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
##########################
First Query
##########################
slot launchslot: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 13, n_tokens = 13
slot release: id 0 | task 0 | stop processing: n_past = 25, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 2335.83 ms / 13 tokens ( 179.68 ms per token, 5.57 tokens per second)
eval time = 857.46 ms / 13 tokens ( 65.96 ms per token, 15.16 tokens per second)
total time = 3193.29 ms / 26 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200
##########################
Second Query
##########################
slot launchslot: id 0 | task 14 | processing task
slot update_slots: id 0 | task 14 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14
slot update_slots: id 0 | task 14 | kv cache rm [0, end)
slot update_slots: id 0 | task 14 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000
slot update_slots: id 0 | task 14 | prompt done, n_past = 14, n_tokens = 14
slot release: id 0 | task 14 | stop processing: n_past = 24, truncated = 0
slot print_timing: id 0 | task 14 |
prompt eval time = 239.87 ms / 14 tokens ( 17.13 ms per token, 58.36 tokens per second)
eval time = 740.46 ms / 11 tokens ( 67.31 ms per token, 14.86 tokens per second)
total time = 980.33 ms / 25 tokens
request: POST /v1/chat/completions 127.0.0.1 200
##########################
Second Query (wrong retry)
##########################
slot launchslot: id 0 | task 19 | processing task
slot update_slots: id 0 | task 19 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14
slot update_slots: id 0 | task 19 | kv cache rm [0, end)
slot update_slots: id 0 | task 19 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000
slot update_slots: id 0 | task 19 | prompt done, n_past = 14, n_tokens = 14
slot release: id 0 | task 19 | stop processing: n_past = 23, truncated = 0
slot print_timing: id 0 | task 19 |
prompt eval time = 240.52 ms / 14 tokens ( 17.18 ms per token, 58.21 tokens per second)
eval time = 665.85 ms / 10 tokens ( 66.58 ms per token, 15.02 tokens per second)
total time = 906.37 ms / 24 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200
Steps/code to reproduce the bug:
from outlines import models, generate
from pydantic import BaseModel
import logging
logging.basicConfig(level="DEBUG")
class Hello(BaseModel):
text: str
model = models.openai(
"llama3.1", base_url="http://127.0.0.1:8080/v1/", api_key="<key>"
)
generator = generate.json(model, Hello)
# Query 1
response = generator("Say hello.")
# Query 2
response = generator("Say good morning.")
Expected result:
I would like to be able to reuse the same generator and model without processing multiple time the query due to the retry.
Error message:
INFO:httpx:HTTP Request: POST http://127.0.0.1:8080/v1/chat/completions "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:openai._base_client:HTTP Request: POST http://127.0.0.1:8080/v1/chat/completions "200 OK"
DEBUG:asyncio:Using selector: KqueueSelector
DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': 'Say good morning.'}], 'model': 'llama3.1', 'frequency_penalty': 0, 'logit_bias': {}, 'max_tokens': None, 'n': 1, 'presence_penalty': 0, 'response_format': {'type': 'json_schema', 'json_schema': {'name': 'default', 'strict': True, 'schema': {'properties': {'text': {'title': 'Text', 'type': 'string'}}, 'required': ['text'], 'title': 'Hello', 'type': 'object'}}}, 'seed': None, 'stop': None, 'temperature': 1.0, 'top_p': 1, 'user': ''}}
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_headers.failed exception=RuntimeError('Event loop is closed')
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.failed exception=RuntimeError('Event loop is closed')
DEBUG:openai._base_client:Encountered Exception
Traceback (most recent call last):
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/openai/_base_client.py", line 1571, in _request
response = await self._client.send(
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_client.py", line 1674, in send
response = await self._send_handling_auth(
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_client.py", line 1702, in _send_handling_auth
response = await self._send_handling_redirects(
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_client.py", line 1739, in _send_handling_redirects
response = await self._send_single_request(request)
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_client.py", line 1776, in _send_single_request
response = await transport.handle_async_request(request)
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 377, in handle_async_request
resp = await self._pool.handle_async_request(req)
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request
raise exc from None
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request
response = await connection.handle_async_request(
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/connection.py", line 101, in handle_async_request
return await self._connection.handle_async_request(request)
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/http11.py", line 142, in handle_async_request
await self._response_closed()
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/http11.py", line 257, in _response_closed
await self.aclose()
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_async/http11.py", line 265, in aclose
await self._network_stream.aclose()
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/httpcore/_backends/anyio.py", line 55, in aclose
await self._stream.aclose()
File "/Users/mdelaferriere/Library/Caches/pypoetry/virtualenvs/llm-playground-NiJBKNsa-py3.10/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1287, in aclose
self._transport.close()
File "/Users/mdelaferriere/.pyenv/versions/3.10.4/lib/python3.10/asyncio/selector_events.py", line 697, in close
self._loop.call_soon(self._call_connection_lost, None)
File "/Users/mdelaferriere/.pyenv/versions/3.10.4/lib/python3.10/asyncio/base_events.py", line 750, in call_soon
self._check_closed()
File "/Users/mdelaferriere/.pyenv/versions/3.10.4/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
DEBUG:openai._base_client:2 retries left
INFO:openai._base_client:Retrying request to /chat/completions in 0.408067 seconds
DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': 'Say good morning.'}], 'model': 'llama3.1', 'frequency_penalty': 0, 'logit_bias': {}, 'max_tokens': None, 'n': 1, 'presence_penalty': 0, 'response_format': {'type': 'json_schema', 'json_schema': {'name': 'default', 'strict': True, 'schema': {'properties': {'text': {'title': 'Text', 'type': 'string'}}, 'required': ['text'], 'title': 'Hello', 'type': 'object'}}}, 'seed': None, 'stop': None, 'temperature': 1.0, 'top_p': 1, 'user': ''}}
DEBUG:httpcore.connection:connect_tcp.started host='127.0.0.1' port=8080 local_address=None timeout=5.0 socket_options=None
Describe the issue as clearly as possible:
Hi and thank you for your great work!
I found a problem when using outlines with a local llm served with a Llama cpp backend on my Mac M3. When using multiple times consecutively the same generator and model for constrained generation, AsyncOpenAI event loop closes unexpectedly resulting in a retry.
The problem is that the request is still processed by my Llama cpp backend, resulting in double processing of every prompt when reusing the same generator. There is not exactly an error but I found the Traceback and retry when logging with level debug. I did not observe the error with a text generator.
For now I handle it by creating a new instance of model and generator everytime I want to process a new request to avoid this duplicate query but it doesn't seem right and maybe it hides something else. Thank you!
Llama.cpp output
main: server is listening on http://127.0.0.1:8080 - starting the main loop srv update_slots: all slots are idle ##########################
First Query
########################## slot launchslot: id 0 | task 0 | processing task slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 13 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 13, n_tokens = 13 slot release: id 0 | task 0 | stop processing: n_past = 25, truncated = 0 slot print_timing: id 0 | task 0 | prompt eval time = 2335.83 ms / 13 tokens ( 179.68 ms per token, 5.57 tokens per second) eval time = 857.46 ms / 13 tokens ( 65.96 ms per token, 15.16 tokens per second) total time = 3193.29 ms / 26 tokens srv update_slots: all slots are idle request: POST /v1/chat/completions 127.0.0.1 200 ##########################
Second Query
########################## slot launchslot: id 0 | task 14 | processing task slot update_slots: id 0 | task 14 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14 slot update_slots: id 0 | task 14 | kv cache rm [0, end) slot update_slots: id 0 | task 14 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000 slot update_slots: id 0 | task 14 | prompt done, n_past = 14, n_tokens = 14 slot release: id 0 | task 14 | stop processing: n_past = 24, truncated = 0 slot print_timing: id 0 | task 14 | prompt eval time = 239.87 ms / 14 tokens ( 17.13 ms per token, 58.36 tokens per second) eval time = 740.46 ms / 11 tokens ( 67.31 ms per token, 14.86 tokens per second) total time = 980.33 ms / 25 tokens request: POST /v1/chat/completions 127.0.0.1 200 ##########################
Second Query (wrong retry)
########################## slot launchslot: id 0 | task 19 | processing task slot update_slots: id 0 | task 19 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 14 slot update_slots: id 0 | task 19 | kv cache rm [0, end) slot update_slots: id 0 | task 19 | prompt processing progress, n_past = 14, n_tokens = 14, progress = 1.000000 slot update_slots: id 0 | task 19 | prompt done, n_past = 14, n_tokens = 14 slot release: id 0 | task 19 | stop processing: n_past = 23, truncated = 0 slot print_timing: id 0 | task 19 | prompt eval time = 240.52 ms / 14 tokens ( 17.18 ms per token, 58.21 tokens per second) eval time = 665.85 ms / 10 tokens ( 66.58 ms per token, 15.02 tokens per second) total time = 906.37 ms / 24 tokens srv update_slots: all slots are idle request: POST /v1/chat/completions 127.0.0.1 200
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information
Context for the issue:
This doesn't affect my work as I found a hack but I think it is worth considering this problem.