vLLM Integration via outlines.server returning full prompt

Kaotic3 commented 8 months ago

Describe the issue as clearly as possible:

When you ask a question you get back the prompt you sent.

This is problematic in a RAG workflow.

It is not replicated in the python examples but it is what happens when you use outlines.serve.

This is why I a say it is a bug.

Steps/code to reproduce the bug:

curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "What is the capital of France?",
        "schema": {"type": "string", "maxLength": 5}
        }'

{"text":["What is the capital of France?\" The \""]}

Expected result:

{"text":["Paris"]

Error message:

No error message.

Outlines/Python version information:

Not sure any of this matters. This is the standard output, it is the way it works doesn't matter what version you are using.

Context for the issue:

A RAG Workflow involves sending documentation to the AI model and asking it a question based on that documentation.

"Does this document contain the number 4?"

This can run to thousands of tokens - which are then returned to you as a part of your "prompt"

And it isn't a simple matter of "replace" the prompt - as the documents you have sent are formatted different when going out, to when they are returned, unless you reformat every single document to do a replace on - and that is a bit difficult, given the potential variations.

Utilising the Python examples that are NOT part of the serving of outlines via vLLM - does NOT result in the prompt being returned.

from pydantic import BaseModel

from outlines import models
from outlines import text

class User(BaseModel):
    name: str
    last_name: str
    id: int

model = models.transformers("mistralai/Mistral-7B-v0.1")
generator = text.generate.json(model, User)
result = generator("Create a user profile with the fields name, last_name and id")
print(result)
# User(name="John", last_name="Doe", id=11)

It doesn't return to you the prompt.

rlouf commented 8 months ago

Yes, that's the standard behavior of vLLM. We could possibly update the integration so it behaves similarly to the rest of Outlines. If someone wants to open a PR I'd be happy to review it.

Kaotic3 commented 8 months ago

Well only on vLLM without the openai wrapper.

If outlines utilised the same wrapper on the .serve as an option maybe - we would be able to nullify the return prompt with echo, and we would have a response that fits within quite a lot of existing tools, allowing outlines to slot into workflows / applications etc.

choucavalier commented 8 months ago

I have the same issue, and the models I tried (google/gemma-7b-it and mistralai/Mistral-7B-Instruct-v0.2) return consistently very bad answers. Something's odd. When serving the same models with ollama I get pretty good results.

Kaotic3 commented 8 months ago

To confirm, my initial tests halted due to the Echo issue but I too was noting that I was not getting good responses - for example the "sample" code provided - didn't provide the answer Paris. It provided a blank response. I was using Mistral Instruct v0.2 as well.

I figured I would come back to that, as I presumed it was my code at fault somewhere - but if someone else is seeing the same, then it may be a problem outside of me.

choucavalier commented 8 months ago

@Kaotic3 if this is is of any help, I manage to get things to work with something like this

import json
import requests
from pydantic import BaseModel, Field
import rich

class Data(BaseModel):
    city_name: str = Field(
        title="Name of the requested city",
    )

json_schema = Data.model_json_schema()

prompt = "Please provide the name of the capital of France"

full_prompt = "<s>[INST] " + prompt + " [/INST]"
payload = {
    "prompt": full_prompt,
    "max_tokens": 2048,
    "schema": json.dumps(json_schema),
}
response = requests.post("http://bigpu:8000/generate", json=payload)
response.raise_for_status()
output = response.json()['text'][0].replace(full_prompt, '')
rich.print_json(output)

Kaotic3 commented 8 months ago

Ahh then it was on me and my code then - I recognise that prompting, so that makes sense.

I think my initial issue of Echo not being available on outlines.serve still stands, but good to know I can get responses once I get there.

mhillebrand commented 7 months ago

Yes, that's the standard behavior of vLLM. We could possibly update the integration so it behaves similarly to the rest of Outlines. If someone wants to open a PR I'd be happy to review it.

@rlouf vLLM only returns the full prompt if you aren't using the OpenAI-compatible vLLM server. When you look at the very top of the non-OpenAI-compatible server script, you see this:

"""
NOTE: This API server is used only for demonstrating usage of AsyncEngine
and simple performance benchmarks. It is not intended for production use.
For production use, we recommend using our OpenAI compatible server.
We are also not going to accept PRs modifying this file, please
change `vllm/entrypoints/openai/api_server.py` instead.
"""

Therefore, I don't think this is the flavor of vLLM that outlines should be supporting.

mhillebrand commented 7 months ago

Yes, that's the standard behavior of vLLM. We could possibly update the integration so it behaves similarly to the rest of Outlines. If someone wants to open a PR I'd be happy to review it.

@rlouf vLLM only returns the full prompt if you aren't using the OpenAI-compatible vLLM server. When you look at the very top of the non-OpenAI-compatible server script, you see this:
"""
NOTE: This API server is used only for demonstrating usage of AsyncEngine
and simple performance benchmarks. It is not intended for production use.
For production use, we recommend using our OpenAI compatible server.
We are also not going to accept PRs modifying this file, please
change `vllm/entrypoints/openai/api_server.py` instead.
"""
Therefore, I don't think this is the flavor of vLLM that outlines should be supporting.

@rlouf I noticed nobody has responded to this. It's rather significant, so maybe I should create a separate issue? Yes, I see that the latest release of Outlines now supports the offline flavor of vLLM, but the OpenAI-compatible vLLM server is much more useful to many people.

elx42 commented 7 months ago

vLLM supports guided output now out of the box. So you may want to go back to a standard vLLM deployment with OpenAI compatible API server. We had the same issue that is described in this issue here, and were quite happy that we could use this new solution.

see: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api

rlouf commented 7 months ago

It actually uses Outlines under the hood

dottxt-ai / outlines