dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.69k stars 495 forks source link

outlines.generate.choice does not escape regex characters #1275

Open cpfiffer opened 6 days ago

cpfiffer commented 6 days ago

Describe the issue as clearly as possible:

Providing strings that include regex characters like . are not escaped, and are directly interpreted when using outlines.generate.choice.

For example, this can lead to the choice generator

# Choices
choices = ["Dr. Smith", "Prof. Jones"]

# Generator
riddler = generate.choice(model, choices)

producing an output like

Dru Smith

where the u should in fact be the literal .

Steps/code to reproduce the bug:

# Best import of all time
from outlines import models, generate
import torch
from transformers import AutoTokenizer

# Load a language model into memory
model = models.transformers(
    "microsoft/Phi-3-mini-4k-instruct",
    device="cuda",
    model_kwargs={"torch_dtype": torch.bfloat16},
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
def to_prompt(text):
    return tokenizer.apply_chat_template([text], tokenize=False)

riddle = """
Dr. Smith teaches either Biology or Chemistry.
Prof. Jones teaches the subject Dr. Smith doesn't teach.
If the Biology teacher wears glasses,
and Prof. Jones doesn't wear glasses, who teaches Biology?
"""

# Choices
choices = ["Dr. Smith", "Prof. Jones"]

# Generator
riddler = generate.choice(model, choices)

# Generate a response
response = riddler(to_prompt(riddle))
print(response)

Expected result:

"Dr. Smith"

Error message:

N/A

Outlines/Python version information:

Version information

``` accelerate==1.1.1 aiohappyeyeballs==2.4.3 aiohttp==3.11.6 aiosignal==1.3.1 airportsdata==20241001 annotated-types==0.7.0 asttokens==2.4.1 async-timeout==5.0.1 attrs==24.2.0 certifi==2024.8.30 charset-normalizer==3.4.0 cloudpickle==3.1.0 comm==0.2.2 datasets==3.1.0 debugpy==1.8.8 decorator==5.1.1 dill==0.3.8 diskcache==5.6.3 exceptiongroup==1.2.2 executing==2.1.0 filelock==3.16.1 frozenlist==1.5.0 fsspec==2024.9.0 huggingface-hub==0.26.2 idna==3.10 interegular==0.3.3 ipykernel==6.29.5 ipython==8.29.0 jedi==0.19.2 Jinja2==3.1.4 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 jupyter_client==8.6.3 jupyter_core==5.7.2 lark==1.2.2 markdown-it-py==3.0.0 MarkupSafe==3.0.2 matplotlib-inline==0.1.7 mdurl==0.1.2 mpmath==1.3.0 multidict==6.1.0 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.4.2 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.1.105 outlines==0.1.3 outlines_core==0.1.14 packaging==24.2 pandas==2.2.3 parso==0.8.4 pexpect==4.9.0 pillow==11.0.0 platformdirs==4.3.6 prompt_toolkit==3.0.48 propcache==0.2.0 psutil==6.1.0 ptyprocess==0.7.0 pure_eval==0.2.3 pyarrow==18.0.0 pycountry==24.6.1 pydantic==2.9.2 pydantic_core==2.23.4 Pygments==2.18.0 python-dateutil==2.9.0.post0 pytz==2024.2 PyYAML==6.0.2 pyzmq==26.2.0 referencing==0.35.1 regex==2024.11.6 requests==2.32.3 rich==13.9.4 rpds-py==0.21.0 safetensors==0.4.5 six==1.16.0 stack-data==0.6.3 sympy==1.13.1 tokenizers==0.20.3 torch==2.4.0 tornado==6.4.1 tqdm==4.67.0 traitlets==5.14.3 transformers==4.46.3 triton==3.0.0 typing_extensions==4.12.2 tzdata==2024.2 urllib3==2.2.3 wcwidth==0.2.13 xxhash==3.5.0 yarl==1.17.2 ```

Context for the issue:

No response