dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.28k stars 472 forks source link

`build_regex_from_schema`: Implementation of `pattern` disagrees with JSON schema spec #1083

Open mwootten opened 3 months ago

mwootten commented 3 months ago

Describe the issue as clearly as possible:

The JSON schema specification states, of the pattern keyword:

A string instance is considered valid if the regular expression matches the instance successfully. Recall: regular expressions are not implicitly anchored.

This means that, for instance, {"type": "string", "pattern": "abcd"} should match "before abcd after". However, the regular expression Outlines currently generates acts as if the pattern is implicitly anchored, and that schema is interpreted as if it was {"type": "string", "pattern": "^abcd$"}

Steps/code to reproduce the bug:

import re
import json
from outlines.generate.json import build_regex_from_schema
sample_schema = json.dumps({'type': 'string', 'pattern': 'abcd'})
sample_regex = build_regex_from_schema(sample_schema)
print(re.match(sample_regex, json.dumps('abcd'))) # => True
print(re.match(sample_regex, json.dumps('before abcd after'))) # => False

Expected result:

The second example should be true, not false

Error message:

No response

Outlines/Python version information:

Version information

``` 0.0.47.dev37+g26e2934 Python 3.12.4 (main, Jun 7 2024, 00:00:00) [GCC 13.3.1 20240522 (Red Hat 13.3.1-1)] aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 attrs==23.2.0 beartype==0.15.0 certifi==2024.6.2 cfgv==3.4.0 chardet==5.2.0 charset-normalizer==3.3.2 cloudpickle==3.0.0 cmake==3.29.5.1 coverage==7.5.3 datasets==2.20.0 diff_cover==9.0.0 dill==0.3.8 diskcache==5.6.3 distlib==0.3.8 distro==1.9.0 filelock==3.15.1 flake8==7.0.0 frozenlist==1.4.1 fsspec==2024.5.0 h11==0.14.0 httpcore==1.0.5 httpx==0.27.0 huggingface-hub==0.23.4 identify==2.5.36 idna==3.7 iniconfig==2.0.0 interegular==0.3.3 Jinja2==3.1.4 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 lark==1.1.9 llama_cpp_python==0.2.78 llvmlite==0.43.0 MarkupSafe==2.1.5 mccabe==0.7.0 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.3 ninja==1.11.1.1 nodeenv==1.9.1 numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.5.40 nvidia-nvtx-cu12==12.1.105 openai==1.34.0 outlines @ ~/Projects/outlines packaging==24.1 pandas==2.2.2 platformdirs==4.2.2 pluggy==1.5.0 pre-commit==3.7.1 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==16.1.0 pyarrow-hotfix==0.6 pycodestyle==2.11.1 pycountry==24.6.1 pydantic==2.7.4 pydantic_core==2.18.4 pyflakes==3.2.0 Pygments==2.18.0 pytest==8.2.2 pytest-benchmark==4.0.0 pytest-cov==5.0.0 pytest-mock==3.14.0 python-dateutil==2.9.0.post0 pytz==2024.1 PyYAML==6.0.1 referencing==0.35.1 regex==2024.5.15 requests==2.32.3 responses==0.25.3 rpds-py==0.18.1 safetensors==0.4.3 setuptools==70.0.0 six==1.16.0 sniffio==1.3.1 sympy==1.12.1 tokenizers==0.19.1 torch==2.3.0 tqdm==4.66.4 transformers==4.41.2 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.1 virtualenv==20.26.2 wheel==0.43.0 xxhash==3.4.1 yarl==1.9.4 (command output here) ```

Context for the issue:

Bringing Outlines into compliance with the JSON schema spec here would unfortunately be a breaking change, as it's likely that at least some users have been expecting the patterns they wrote to be exact matches rather than partial matches.

aw632 commented 3 months ago

This is probably a side effect of interregular, which has implicit anchoring: https://github.com/MegaIng/interegular/issues/10

lapp0 commented 2 months ago

Bringing Outlines into compliance with the JSON schema spec here would unfortunately be a breaking change, as it's likely that at least some users have been expecting the patterns they wrote to be exact matches rather than partial matches.

Possibly, however this feature didn't even work until 2 months ago https://github.com/outlines-dev/outlines/commit/60e89f5706e3d0f9837e271e04a39fb6e81d92df

A good approach might be to implement this, and provide a warning describing the new behavior and mitigation strategies if a pattern is used.