dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.19k stars 469 forks source link

`AssertionError` in case-insensitive regex containing specific characters (`¤` `ß` `İ` `ʼn` `ǰ` `ΐ` `ΰ`) #773

Open cifkao opened 7 months ago

cifkao commented 7 months ago

Describe the issue as clearly as possible:

Specific characters trigger an AssertionError in make_byte_level_fsm if included in a case-insensitive regex group (e.g. (?i:ß)).

So far, I have found any of the following characters to trigger the error: ¤ ß İ ʼn ǰ ΐ ΰ

Steps/code to reproduce the bug:

import outlines

model = outlines.models.transformers("distilgpt2")

outlines.generate.regex(model, r"(?i:ß)")

Expected result:

<outlines.generate.api.SequenceGenerator at 0x...>

Error message:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[100], line 1
----> 1 outlines.generate.regex(model, r"(?i:ß)")

File ~/mambaforge/envs/test/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/generate/regex.py:32, in regex(model, regex_str, sampler)
     11 @singledispatch
     12 def regex(model, regex_str: str, sampler: Sampler = multinomial()):
     13     """Generate structured text in the language of a regular expression.
     14 
     15     Parameters
   (...)
     30 
     31     """
---> 32     fsm = RegexGuide(regex_str, model.tokenizer)
     34     device = model.device
     35     generator = SequenceGenerator(fsm, model, sampler, device)

File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/fsm/guide.py:146, in RegexGuide.__init__(self, regex_string, tokenizer)
    136         raise ValueError(
    137             "The vocabulary does not allow us to build a sequence that matches the input regex"
    138         )
    140     return states_to_token_maps, empty_token_ids, regex_fsm.finals
    142 (
    143     self.states_to_token_maps,
    144     self.empty_token_ids,
    145     fsm_finals,
--> 146 ) = create_states_mapping(
    147     regex_string, tuple(sorted(tokenizer.vocabulary.items()))
    148 )
    149 self.vocabulary = list(tokenizer.vocabulary.values())
    150 self.eos_token_id = tokenizer.eos_token_id

File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/caching.py:74, in cache.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
     72 if cache_key in memory:
     73     return memory[cache_key]
---> 74 result = cached_function(*args, **kwargs)
     75 memory[cache_key] = result
     76 return result

File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/fsm/guide.py:121, in RegexGuide.__init__.<locals>.create_states_mapping(regex_string, cacheable_vocabulary)
    117 """Create the variables related to the mapping between states and tokens
    118 The parameters of the function are used for caching purpose
    119 """
    120 regex_pattern = interegular.parse_pattern(regex_string)
--> 121 byte_fsm = make_byte_level_fsm(
    122     regex_pattern.to_fsm().reduce(), keep_utf8=True
    123 )
    124 regex_fsm, _ = make_deterministic_fsm(byte_fsm)
    125 states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer(
    126     regex_fsm, tokenizer
    127 )

File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/fsm/regex.py:223, in make_byte_level_fsm(fsm, keep_utf8)
    221 max_key = max(fsm.alphabet.values())
    222 for symbol, transition_key in fsm.alphabet.items():
--> 223     assert symbol == anything_else or len(symbol) == 1
    224     if symbol == anything_else or ord(symbol) < 0x80:
    225         symbol_mapping[symbol] = transition_key

AssertionError:

Outlines/Python version information:

Version information

``` 0.0.37 Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] ```

Context for the issue:

No response

lapp0 commented 5 months ago

Fix

pip uninstall interegular -y
pip install "git+https://github.com/lapp0/interegular@fix-multi-char"

The PR is here, but the maintainer seems occupied at the moment, as other PRs haven't been addressed in a few months.

Explanation

Typically when the length of character is 1, character.upper() also has length 1.

In this case:

 >>> 'ß'.upper()
'SS'

The byte level fsm doesn't expect multiple characters, and interegular has inconsistent handling of multiple characters:

>>> regex_pattern = interegular.parse_pattern(r"(?i:ß)")
>>> list(regex_pattern.to_fsm().strings())
[['SS'], ['ß']]
>>> regex_pattern.to_fsm().accepts('ß')
True
>>> regex_pattern.to_fsm().accepts("SS")
False

Therefore, the bug is in interegular.

I'll make an upstream PR to ensure results of str.upper() and str.lower() only span a single character. Pruning SS is consistent with re's behavior:

>>> print(re.match(r"(?i:ß)", "ß"))
<re.Match object; span=(0, 1), match='ß'>
>>> print(re.match(r"(?i:ß)", "SS"))
None