Closed DhruvaBansal00 closed 2 weeks ago
It's a well known issue with the vLLM integration when batch processing is used unfortunately. Outlines' overhead at runtime is negligible.
This is because the logit processor is not batched, and is in the critical path of the inference engine. I suggest closing this PR and tracking on the vLLM side, as this is irrelevant to the outlines.
Describe the issue as clearly as possible:
Test Model: NousResearch/Meta-Llama-3-8B-Instruct Inference Engine: vLLM v0.6.3.post1 GPU: A100 40GB
No System Prompt User Prompt:
Output the following JSON again as is without changing anything: {"First Name": "Anonymous", "Last Name": "Anonymous", "Email": "Anonymous", "Phone": "Anonymous", "Company": "Anonymous", "Title": "Anonymous", "LinkedIn": "Anonymous"} - Output the JSON only, nothing else. Output:
I am initializing an async engine using vllm with outlines set as the backend for guided decoding. I am then sending 29 parallel requests to the above vLLM server with and without a response format. The response format I am using is:
{'title': 'AnswerFormat', 'description': 'Answer to the provided prompt.', 'type': 'object', 'properties': {'First Name': {'title': 'First Name', 'type': 'string'}, 'Last Name': {'title': 'Last Name', 'type': 'string'}, 'Email': {'title': 'Email', 'type': 'string'}, 'Phone': {'title': 'Phone', 'type': 'string'}, 'Company': {'title': 'Company', 'type': 'string'}, 'Title': {'title': 'Title', 'type': 'string'}, 'LinkedIn': {'title': 'LinkedIn', 'type': 'string'}}, 'required': ['First Name', 'Last Name', 'Email', 'Phone', 'Company', 'Title', 'LinkedIn'], 'additionalProperties': False, 'definitions': {}}
The total turn around time for 1000 total requests is 98.1s without response format set. However, with response format set to the above, the total turn around time is 453.9s. Finding it quite weird that the increase in latency due to the response format is almost 4.5x. I have verified that the outputs produced in both cases is exactly the same everytime with temperature set to 0.
Steps/code to reproduce the bug:
Expected result:
Error message:
No response
Outlines/Python version information:
Version information
Context for the issue:
No response