Open cifkao opened 7 months ago
pip uninstall interegular -y
pip install "git+https://github.com/lapp0/interegular@fix-multi-char"
The PR is here, but the maintainer seems occupied at the moment, as other PRs haven't been addressed in a few months.
Typically when the length of character
is 1, character.upper()
also has length 1.
In this case:
>>> 'ß'.upper()
'SS'
The byte level fsm doesn't expect multiple characters, and interegular
has inconsistent handling of multiple characters:
>>> regex_pattern = interegular.parse_pattern(r"(?i:ß)")
>>> list(regex_pattern.to_fsm().strings())
[['SS'], ['ß']]
>>> regex_pattern.to_fsm().accepts('ß')
True
>>> regex_pattern.to_fsm().accepts("SS")
False
Therefore, the bug is in interegular.
I'll make an upstream PR to ensure results of str.upper()
and str.lower()
only span a single character. Pruning SS
is consistent with re
's behavior:
>>> print(re.match(r"(?i:ß)", "ß"))
<re.Match object; span=(0, 1), match='ß'>
>>> print(re.match(r"(?i:ß)", "SS"))
None
Describe the issue as clearly as possible:
Specific characters trigger an
AssertionError
inmake_byte_level_fsm
if included in a case-insensitive regex group (e.g.(?i:ß)
).So far, I have found any of the following characters to trigger the error:
¤
ß
İ
ʼn
ǰ
ΐ
ΰ
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information
Context for the issue:
No response