Open Kaotic3 opened 8 months ago
Yes, that's the standard behavior of vLLM. We could possibly update the integration so it behaves similarly to the rest of Outlines. If someone wants to open a PR I'd be happy to review it.
Well only on vLLM without the openai wrapper.
If outlines utilised the same wrapper on the .serve as an option maybe - we would be able to nullify the return prompt with echo, and we would have a response that fits within quite a lot of existing tools, allowing outlines to slot into workflows / applications etc.
I have the same issue, and the models I tried (google/gemma-7b-it
and mistralai/Mistral-7B-Instruct-v0.2
) return consistently very bad answers. Something's odd. When serving the same models with ollama
I get pretty good results.
To confirm, my initial tests halted due to the Echo issue but I too was noting that I was not getting good responses - for example the "sample" code provided - didn't provide the answer Paris. It provided a blank response. I was using Mistral Instruct v0.2 as well.
I figured I would come back to that, as I presumed it was my code at fault somewhere - but if someone else is seeing the same, then it may be a problem outside of me.
@Kaotic3 if this is is of any help, I manage to get things to work with something like this
import json
import requests
from pydantic import BaseModel, Field
import rich
class Data(BaseModel):
city_name: str = Field(
title="Name of the requested city",
)
json_schema = Data.model_json_schema()
prompt = "Please provide the name of the capital of France"
full_prompt = "<s>[INST] " + prompt + " [/INST]"
payload = {
"prompt": full_prompt,
"max_tokens": 2048,
"schema": json.dumps(json_schema),
}
response = requests.post("http://bigpu:8000/generate", json=payload)
response.raise_for_status()
output = response.json()['text'][0].replace(full_prompt, '')
rich.print_json(output)
Ahh then it was on me and my code then - I recognise that prompting, so that makes sense.
I think my initial issue of Echo not being available on outlines.serve still stands, but good to know I can get responses once I get there.
Yes, that's the standard behavior of vLLM. We could possibly update the integration so it behaves similarly to the rest of Outlines. If someone wants to open a PR I'd be happy to review it.
@rlouf vLLM only returns the full prompt if you aren't using the OpenAI-compatible vLLM server. When you look at the very top of the non-OpenAI-compatible server script, you see this:
"""
NOTE: This API server is used only for demonstrating usage of AsyncEngine
and simple performance benchmarks. It is not intended for production use.
For production use, we recommend using our OpenAI compatible server.
We are also not going to accept PRs modifying this file, please
change `vllm/entrypoints/openai/api_server.py` instead.
"""
Therefore, I don't think this is the flavor of vLLM that outlines should be supporting.
Yes, that's the standard behavior of vLLM. We could possibly update the integration so it behaves similarly to the rest of Outlines. If someone wants to open a PR I'd be happy to review it.
@rlouf vLLM only returns the full prompt if you aren't using the OpenAI-compatible vLLM server. When you look at the very top of the non-OpenAI-compatible server script, you see this:
""" NOTE: This API server is used only for demonstrating usage of AsyncEngine and simple performance benchmarks. It is not intended for production use. For production use, we recommend using our OpenAI compatible server. We are also not going to accept PRs modifying this file, please change `vllm/entrypoints/openai/api_server.py` instead. """
Therefore, I don't think this is the flavor of vLLM that outlines should be supporting.
@rlouf I noticed nobody has responded to this. It's rather significant, so maybe I should create a separate issue? Yes, I see that the latest release of Outlines now supports the offline flavor of vLLM, but the OpenAI-compatible vLLM server is much more useful to many people.
vLLM supports guided output now out of the box. So you may want to go back to a standard vLLM deployment with OpenAI compatible API server. We had the same issue that is described in this issue here, and were quite happy that we could use this new solution.
see: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api
It actually uses Outlines under the hood
Describe the issue as clearly as possible:
When you ask a question you get back the prompt you sent.
This is problematic in a RAG workflow.
It is not replicated in the python examples but it is what happens when you use outlines.serve.
This is why I a say it is a bug.
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Not sure any of this matters. This is the standard output, it is the way it works doesn't matter what version you are using.
Context for the issue:
A RAG Workflow involves sending documentation to the AI model and asking it a question based on that documentation.
"Does this document contain the number 4?"
This can run to thousands of tokens - which are then returned to you as a part of your "prompt"
And it isn't a simple matter of "replace" the prompt - as the documents you have sent are formatted different when going out, to when they are returned, unless you reformat every single document to do a replace on - and that is a bit difficult, given the potential variations.
Utilising the Python examples that are NOT part of the serving of outlines via vLLM - does NOT result in the prompt being returned.
It doesn't return to you the prompt.