Outlines Guided generation taking ~4.5x longer compared to non guided generation on vLLM

DhruvaBansal00 commented 2 weeks ago

Describe the issue as clearly as possible:

Test Model: NousResearch/Meta-Llama-3-8B-Instruct Inference Engine: vLLM v0.6.3.post1 GPU: A100 40GB

No System Prompt User Prompt: Output the following JSON again as is without changing anything: {"First Name": "Anonymous", "Last Name": "Anonymous", "Email": "Anonymous", "Phone": "Anonymous", "Company": "Anonymous", "Title": "Anonymous", "LinkedIn": "Anonymous"} - Output the JSON only, nothing else. Output:

I am initializing an async engine using vllm with outlines set as the backend for guided decoding. I am then sending 29 parallel requests to the above vLLM server with and without a response format. The response format I am using is: {'title': 'AnswerFormat', 'description': 'Answer to the provided prompt.', 'type': 'object', 'properties': {'First Name': {'title': 'First Name', 'type': 'string'}, 'Last Name': {'title': 'Last Name', 'type': 'string'}, 'Email': {'title': 'Email', 'type': 'string'}, 'Phone': {'title': 'Phone', 'type': 'string'}, 'Company': {'title': 'Company', 'type': 'string'}, 'Title': {'title': 'Title', 'type': 'string'}, 'LinkedIn': {'title': 'LinkedIn', 'type': 'string'}}, 'required': ['First Name', 'Last Name', 'Email', 'Phone', 'Company', 'Title', 'LinkedIn'], 'additionalProperties': False, 'definitions': {}}

The total turn around time for 1000 total requests is 98.1s without response format set. However, with response format set to the above, the total turn around time is 453.9s. Finding it quite weird that the increase in latency due to the response format is almost 4.5x. I have verified that the outputs produced in both cases is exactly the same everytime with temperature set to 0.

Steps/code to reproduce the bug:

# initializing the async vllm engine
engine_args = AsyncEngineArgs(
        model=model_path,
        tensor_parallel_size=gpu_count,
        gpu_memory_utilization=gpu_memory_utilization,
        disable_log_stats=False,
        disable_log_requests=False,
        max_num_batched_tokens=max_model_len,
        max_model_len=max_model_len,
        enable_lora=use_lora,
        max_lora_rank=64,
        max_loras=32,
        guided_decoding_backend="outlines",
        disable_async_output_proc=True,
        **kwargs,
    )

# sending requests in parallel
import json
import json5
import pandas as pd
import tqdm
import requests
import os
import concurrent

compute_metrics_dataset = []
total_samples = 1000
endpoint = "<insert endpoint hosting vLLM server>"
response_format = {'title': 'AnswerFormat', 'description': 'Answer to the provided prompt.', 'type': 'object', 'properties': {'First Name': {'title': 'First Name', 'type': 'string'}, 'Last Name': {'title': 'Last Name', 'type': 'string'}, 'Email': {'title': 'Email', 'type': 'string'}, 'Phone': {'title': 'Phone', 'type': 'string'}, 'Company': {'title': 'Company', 'type': 'string'}, 'Title': {'title': 'Title', 'type': 'string'}, 'LinkedIn': {'title': 'LinkedIn', 'type': 'string'}}, 'required': ['First Name', 'Last Name', 'Email', 'Phone', 'Company', 'Title', 'LinkedIn'], 'additionalProperties': False, 'definitions': {}}
def process_sample(i):
    if i % 10 == 0:
        print(f"Currently processing sample {i} out of {total_samples}")
    messages = [{"role": "user", "content": prompt}]
    # payload = {"messages": messages, "parameters": {"temperature": 0.0, "max_tokens": 4096}}
    payload = {"messages": messages, "parameters": {"temperature": 0.0, "max_tokens": 4096}, "response_format": response_format}
    try:
        response = requests.post(
            endpoint,
            json=payload,
            timeout=120
        )
        compute_metrics_dataset.append({
            "input": prompt,
            "gt": {"First Name": "Anonymous", "Last Name": "Anonymous", "Email": "Anonymous", "Phone": "Anonymous", "Company": "Anonymous", "Title": "Anonymous", "LinkedIn": "Anonymous"},
            "output": json5.loads(json5.loads(response.json())["generated_text"])
        })
    except Exception as e:
        print(f"Exception {e}. Output: {response.text}")

with concurrent.futures.ThreadPoolExecutor(max_workers=29) as executor:
    futures = [executor.submit(process_sample, i % total_samples) for i in range(total_samples)]
compute_metrics_dataset = pd.DataFrame.from_records(compute_metrics_dataset)

Expected result:

Similar turnaround time

Error message:

No response

Outlines/Python version information:

Version information

``` outlines@main ```

Context for the issue:

No response

rlouf commented 2 weeks ago

It's a well known issue with the vLLM integration when batch processing is used unfortunately. Outlines' overhead at runtime is negligible.

aarnphm commented 2 weeks ago

This is because the logit processor is not batched, and is in the critical path of the inference engine. I suggest closing this PR and tracking on the vLLM side, as this is irrelevant to the outlines.

dottxt-ai / outlines