dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
8.25k stars 417 forks source link

llama_cpp - Multiple calls to 'choice' generator do not return results. #1109

Open willkurt opened 3 weeks ago

willkurt commented 3 weeks ago

Describe the issue as clearly as possible:

When using outlines.models.llama_cpp and making repeated calls to an instances of outlines.generate.choice only the first call returns a results. This can be resolved by re-instantiating the generate for every call, but this is not an ideal solution.

The model I use in the example code is directly from the Cookbook CoT example, but this issue arose with multiple different models I had attempted earlier.

The example code will produce the following output when I run it:

result: clothing
result: 
result: 

I am running this on a Mac M2 and an M3 Macbook

Steps/code to reproduce the bug:

import llama_cpp
from outlines import generate, models
from textwrap import dedent

llama_tokenizer = llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
            "NousResearch/Hermes-2-Pro-Llama-3-8B"
            )
tokenizer = llama_tokenizer.hf_tokenizer

model = models.llamacpp("NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF",
            "Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf",
            tokenizer=llama_tokenizer,
            n_gpu_layers=-1,
            flash_attn=True,
            n_ctx=8192,
            verbose=False)

complaint_data = [{'message': 'Hi, my name is Olivia Brown.I recently ordered a knife set from your wellness range, and it arrived earlier this week. Unfortunately, my satisfaction with the product has been less than ideal.My order was A123456',
  'order_number': 'A12-3456',
  'department': 'kitchen'},
 {'message': 'Hi, my name is John Smith.I recently ordered a dress for an upcoming event, which was alleged to meet my expectations both in fit and style. However, upon arrival, it became apparent that the fabric was of subpar quality, leading to a less than satisfactory appearance.The order number is A12-3456',
  'order_number': 'A12-3456',
  'department': 'clothing'},
 {'message': 'Hi, my name is Sarah Johnson.I recently ordered the ultimate ChefMaster 8 Drawer Cooktop. However, upon delivery, I discovered that one of the burners is malfunctioning.My order was A458739',
  'order_number': 'A45-8739',
  'department': 'kitchen'}]

departments = ["clothing","electronics","kitchen","automotive"]

def create_prompt(complaint):
    prompt_messages = [
        {
            "role": "system",
            "content": "You are as agent designed to help label complaints."
        },
        {
        "role": "user",
        "content": dedent("""
        I'm going to provide you with a consumer complaint to analyze.
        The complaint is going to be regarding a product from one of our
        departments. Here is the list of departments:
            - "clothing"
            - "electronics"
            - "kitchen"
            - "automotive"
        Please reply with *only* the name of the department.
        """)
    },{
        "role": "assistant",
        "content": "I understand and will only answer with the department name"
    },{
        "role": "user",
        "content": f"Great! Here is the complaint: {complaint['message']}"
    }

                      ]
    prompt = tokenizer.apply_chat_template(prompt_messages, tokenize=False)
    return prompt

if __name__ == "__main__":
    generator_struct = generate.choice(model,departments)
    for complaint in complaint_data:
        prompt = create_prompt(complaint)
        result = generator_struct(prompt)
        print(f"result: {result}")

Expected result:

result: clothing
result: clothing
result: electronics

Error message:

No response

Outlines/Python version information:

Version information

0.0.46 Python 3.11.0 (main, Jul 6 2024, 12:54:41) [Clang 15.0.0 (clang-1500.3.9.4)] aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 attrs==24.2.0 certifi==2024.7.4 charset-normalizer==3.3.2 cloudpickle==3.0.0 datasets==2.21.0 dill==0.3.8 diskcache==5.6.3 filelock==3.15.4 frozenlist==1.4.1 fsspec==2024.6.1 huggingface-hub==0.24.6 idna==3.7 interegular==0.3.3 Jinja2==3.1.4 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 lark==1.2.2 llama_cpp_python==0.2.89 llvmlite==0.43.0 MarkupSafe==2.1.5 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.3 numba==0.60.0 numpy==1.26.4 outlines==0.0.46 packaging==24.1 pandas==2.2.2 pyairports==2.1.1 pyarrow==17.0.0 pycountry==24.6.1 pydantic==2.8.2 pydantic_core==2.20.1 python-dateutil==2.9.0.post0 pytz==2024.1 PyYAML==6.0.2 referencing==0.35.1 regex==2024.7.24 requests==2.32.3 rpds-py==0.20.0 safetensors==0.4.4 six==1.16.0 sympy==1.13.2 tokenizers==0.19.1 torch==2.4.0 tqdm==4.66.5 transformers==4.44.1 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 xxhash==3.5.0 yarl==1.9.4

Context for the issue:

This issue arose while putting together an Outlines workshop for ODSC. I had originally hoped to use llama_cpp for the workshop but this (and another soon to be posted bug) were blockers (I ended up using transformers instead).

cpfiffer commented 3 weeks ago

I had the same issue on a different application, but I figured it was mostly inexperience. I believe I ended up recreating the generator each time, which is a temporary workaround for people who stumble on the issue.

Note that this will be slow and (I think) requires rebuilding the FSM each time.

lapp0 commented 1 day ago

The SequenceGeneratorAdapter should be creating a new logits processor each run, but it isn't.

Should be an easy fix.