deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.92k stars 1.93k forks source link

OpenAIGenerator uses chat_completions endpoint. Error with model that has no chat_template in config #8275

Open Permafacture opened 3 months ago

Permafacture commented 3 months ago

Describe the bug I'm using the OpenAIGenerator to access a vLLM endpoint on runpod. When using a base model like Mistral v0.3 that has not been instruction tuned and so does not have a chat template in it's config for the tokenizer, I get an error returned from the api endpoint. Digging into this I see that the OpenAIGenerator uses the chat_completion/ endpoint for the OpenAIGenerator and not the completion/ endpoint. This means I've been unintentionally using a chat template with other models up to this point.

Error message "Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed!"

Expected behavior I expected for the completions/ api endpoint to be used and the hugging face model to not try to use apply_chat_template()

Additional context I tried to use the client.completions method directly as a work around.

completion = generator.client.completions.create(model=generator.model, prompt="And then, something unexpected happened.", **generator.generation_kwargs)

The process on the server crashes with a 'NoneType' object has no attribute 'headers'.

System:

lbux commented 3 months ago

Can you please provide some sample code to try and reproduce the error?

I understand why it is happening (completions API is legacy and might stop being supposed by OpenAI). There are ways to get the chat completions endpoint to mimic the completions one and that is what Haystack tries to do, but I'd need an example to see if the issue is with vLLM or Haystack.

Permafacture commented 3 months ago

The header error is definitely on vLLM, or at least the fork the runpod folks are using. But I don't think it's right to have the text completion class use the chat completion endpoint. If the completions endpoint gets removed then it's best to let the calls fail and inform the user rather than use a different endpoint in my opinion. I was getting weird responses and I wouldn't have ever known why it f I didn't try a model that didn't have a chat template.

Like if you prompt "it was a normal summer day until something unexpected happened" the chat endpoint will respond "what happened?" rather than continue the story.

If you want to keep things as is to not break existing users code you could add a boolean kwarg raw or something just so the behavior is documented and users have the option of using a completions endpoint.

lbux commented 3 months ago

I definitely see benefits and downsides to using the basic completions vs the chat completions for the regular generator.

Using OpenAIGenerator and a "prompt" while then converting it to ChatMessage in the backend allows for users to quickly try the generators without having to worry about roles. This also allows users to be able to use the most recent models available from OpenAI (4o and mini are not available in the regular completions api).

In regard to completions... some models are definitely smart enough to finish typing what you write, and you can reinforce it by setting a system prompt that tells it how exactly to complete it.

And then when it comes to setting the api_base_url and templates, since the chat completions endpoint is being used, some implementations of an open ai api compatible server may handle it differently. For example, this is how Ollama handles it:

By default, models imported into Ollama have a default template of {{ .Prompt }}, i.e. user inputs are sent verbatim to the LLM. This is appropriate for text or code completion models but lacks essential markers for chat or instruction models.

This means that Ollama can effectively mimic a completions call to the chat completions api despite a model not having a template (like base models do). vLLM does not seem to take this approach and unless they want to provide a default "fallback" template, you will probably need to provide your own that does what Ollama implemented.

It may be possible to set a flag as you specified and conditionally call the regular completions api (add it to generation_kwargs and extract it if present), but I don't believe it should be default behavior since most users are probably not using api_base_url.

I'll leave the rest to the Haystack team to see how they wish to proceed.

vblagoje commented 2 months ago

cc @julian-risch to assign in the next sprint

vblagoje commented 2 months ago

@julian-risch I've read this issue report in detail and understand what @Permafacture is asking for but in the light of our plan to deprecate all generators I wonder how relevant would work on such an issue be! I recommend closing with "Won't fix".

julian-risch commented 2 months ago

So far it's only an idea. We have not decided yet whether to change anything about the generators. I'll move the issue to "Hold" in the mean time.