dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
8.41k stars 427 forks source link

Regex generation does not work #1118

Closed aw632 closed 1 month ago

aw632 commented 1 month ago

Describe the issue as clearly as possible:

Regex generation with certain regex strings will produce strings that don't match the regex.

I have this regex string:

^(\[TOOL_CALLS\] \{\}|(?!\[TOOL_CALLS\]).*)$

Since interregular has implicit anchoring, I use this instead:

(\[TOOL_CALLS\] \{\}|(?!\[TOOL_CALLS\]).*)

However, with this, I am getting outputs that don't match the original regex (with anchoring). Instead, the outputs are consistent with the regex without anchoring. See this test online, and try to remove/add the ^ and $: https://regex101.com/r/EZIPmV/2. For instance, I'm getting outputs like

[TOOL_CALLS]  {}, {"name": "

The interregular maintainer confirmed that the anchoring is implicit in that dependency, so I've narrowed it down to just outlines itself having this issue.

Note: generate will strip out the special tokens like [TOOL_CALLS], but you can see them if you modify generate or if you use a different inference engine like vLLM (I was able to produce the same issues there as well).

Steps/code to reproduce the bug:

from outlines import models, generate

model = models.transformers("mistralai/Mistral-Nemo-Instruct-2407")

generator = generate.regex(
    model,
    r'(\[TOOL_CALLS\] \{\}|(?!\[TOOL_CALLS\]).*)',
)

prompt = """[AVAILABLE_TOOLS][{"type": "function", "function": {"name": "add", "description": "add two numbers", "parameters": {"type": "object", "properties": {"a": {"description": "First number", "type": "integer"}, "b": {"description": "Second number", "type": "integer"}}, "required": ["a", "b"]}}}, {"type": "function", "function": {"name": "multiply", "description": "multiply two numbers", "parameters": {"type": "object", "properties": {"a": {"description": "First number", "type": "integer"}, "b": {"description": "Second number", "type": "integer"}}, "required": ["a", "b"]}}}][/AVAILABLE_TOOLS][INST]What is 5 + 9?[/INST]"""
answer = generator(prompt, max_tokens=300)

print(f"{answer=}")

Expected result:

[TOOL_CALLS] {}

Error message:

No response

Outlines/Python version information:

Version information

``` 0.0.46 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] ```

Context for the issue:

No response

lapp0 commented 1 month ago

The FSM produced by interegular cannot produce any complete strings.

This is likely caused by interegulars incomplete negative lookaround implementation

>>> import interegular
>>> pattern = r"(\[TOOL_CALLS\] \{\}|(?!\[TOOL_CALLS\]).*)"
>>> fsm = interegular.parse_pattern(pattern).to_fsm()
>>> ["".join(s) for s in fsm.strings(100)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py", line 684, in strings
    raise ValueError(f"Couldn't find an example within {max_iterations} iterations")
ValueError: Couldn't find an example within 100 iterations

It is valid with re, but not interegular

>>> import re
>>> re.match(pattern, "[TOOL_CALLS] {}")
<re.Match object; span=(0, 15), match='[TOOL_CALLS] {}'>
>>> fsm.accepts("[TOOL_CALLS] {}")
False
>>> re.match(pattern, '[TOOL_CALLS] {}, {"name": "toolname"}')
<re.Match object; span=(0, 15), match='[TOOL_CALLS] {}'>
>>> fsm.accepts('[TOOL_CALLS] {}, {"name": "toolname"}')
False

You might consider a simpler pattern. Please let me know if you have any other questions.