Support Chat Templates as a parameter to the generate methods

chrsbats commented 7 months ago

Presentation of the new feature

When using Outlines on various open models available on Huggingface I have often needed to add various chat template tags to the prompt I am passing into a sequence generator. This usually dramatically increases the accuracy and quality of response because the underlying model has been trained assuming a chat format. This is particularly true with small models like the Phi-2 model.

For example, if you use a local Phi-2 model this first example given in the outlines docs will often result in an incorrect response:

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""
generator = outlines.generate.choice(model, ["Positive", "Negative"])

However changing the prompt to use Phi-2's prompting format as follows will fix the problem

prompt = """Instruct: You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!

Output:
"""
generator = outlines.generate.choice(model, ["Positive", "Negative"])

Where does it fit in Outlines?

I propose something like this:

model = models.transformers("mistralai/Mistral-7B-0.1") sampler = samplers.multinomial(3) template = templates.chatml(system_prompt="You are an expert mathematician")

generator = generate.text(model, sampler, template) answer = generator("What is 2+2?")

There could be a generic template factory function that takes a jinja2 format.

template = templates.create_template(jinja_template)(system_prompt="You are an expert mathematician")

Are you willing to open a PR?

Yes.

chrsbats commented 7 months ago

Migrating from discord to here as to facilitate discussion on possible design:

Another option would be have a chat template prompt take another prompt as input

answer = outlines.generate.text(model,sampler) (chat_template(system_prompt),prompt(input)), max_tokens=100)

One argument against this approach is that chat template is usually tied to the model itself and is based on whatever format the training data was passed to the model was. It's worth noting that huggingface's Transformer library ties the template to the tokenizer as via

 tokenizer.apply_chat_template

chrsbats commented 7 months ago

In practice I tend to compose all these things (model, system prompt, chat template) and create a function f that simply takes the chat history and next prompt and returns next text output. Sometimes I want to change temperature on the fly depending if I creative or deterministic output. I nearly always use a stop token instead of max tokens as I prefer to rely ask the language model to return a paragraph or a sentence as a response and rely on the model to determine when it has finished answering.

answer = f(prompt(input_vars), chat_history, temp, stop)

I don't know what other people do, but feels to me like this is the most common use case.

chrsbats commented 7 months ago

From experimenting with many models I have a bunch of chat templates like the following

@outlines.prompt
def chatml_template(system,query,history=[]):
    '''<|im_start|>system
    {{ system }}<|im_end|>
    {% for example in history %}
    <|im_start|>user
    {{ example[0] }}<|im_end|>
    <|im_start|>assistant
    {{ example[1] }}<|im_end|>
    {% endfor %}
    <|im_start|>user
    {{ query }}<|im_end|>
    <|im_start|>assistant

    '''

Alternatively I can just include them in the library somewhere

chrsbats commented 7 months ago

One idea would be to have a generate.chat method that is separate to generate.text that supports a template in the constructor?

chrsbats commented 7 months ago

Another thing I sometimes do (which might influence the design) is send in a chat history as a few shot prompt example using the chat template and then ask for the next response to conform to a grammar by building a function as follows

answer = f(prompt(input_vars), chat_history, temp, answer_regex)

During development what I try to do is get the answer as close as possible to what I want without a regex/grammar. This way I know that the LLM as a decent understanding of what I want and that it can generate correct responses with high probability without any additional help. Then I add a regex/grammar to the call as a guarantee so I know for sure I will be able parse whatever it sends back.

If I add a grammar too soon during the development process I end up enforcing a low probability output, which means the LLM didn't really know what to do and the quality of the output suffers (either I didn't prompt correctly or the task given was simply beyond the ability of the model and needs to be revised or split into sub-tasks)

roberthoenig commented 6 months ago

@chrsbats I'd love to see chat support in outlines. Do you have a workflow that allows you to generate structured outputs in chat mode without in-built outline support?

rlouf commented 6 months ago

Sorry I somehow missed that issue, I will take a look shortly!

chrsbats commented 6 months ago

@roberthoenig Sorry for the late reply. I've been away on holiday.

Yeah I do, I'd just have to extract it out of another project and clean it up a little. Creating a Chat class that uses outlines underneath would be another option for integration.

The main gotcha is that depending on your model/backend you have be careful with your prompts due to #750

chrsbats commented 6 months ago

@roberthoenig @rlouf

This is how I do it. https://github.com/chrsbats/outlines-chat

dottxt-ai / outlines