dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.42k stars 478 forks source link

Cannot create any JSON generator with Mistral #710

Closed posionus closed 5 months ago

posionus commented 8 months ago

Describe the issue as clearly as possible:

I cannot create an outlines generator with a mistral model and pydantic schema.

Steps/code to reproduce the bug:

import outlines
from pydantic import BaseModel

class DumbModel(BaseModel):
    dumb: str

ol_model = outlines.models.transformers("TheBloke/Mistral-7B-Instruct-v0.2-AWQ")
generator = outlines.generate.json(ol_model, DumbModel)

### Expected result:

```shell
Should complete without error.

Error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <command-368285070156900>, line 4
      1 import outlines
      3 ol_model = outlines.models.transformers("TheBloke/Mistral-7B-Instruct-v0.2-AWQ")
----> 4 generator = outlines.generate.json(ol_model, DumbModel)

File /usr/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-6b70ef65-4326-43e1-96be-c40c32b92a8e/lib/python3.10/site-packages/outlines/generate/json.py:49, in json(model, schema_object, sampler, whitespace_pattern)
     47     schema = pyjson.dumps(schema_object.model_json_schema())
     48     regex_str = build_regex_from_schema(schema, whitespace_pattern)
---> 49     generator = regex(model, regex_str, sampler)
     50     generator.format_sequence = lambda x: schema_object.parse_raw(x)
     51 elif callable(schema_object):

File /usr/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-6b70ef65-4326-43e1-96be-c40c32b92a8e/lib/python3.10/site-packages/outlines/generate/regex.py:35, in regex(model, regex_str, sampler)
     14 @singledispatch
     15 def regex(model, regex_str: str, sampler: Sampler = multinomial()):
     16     """Generate structured text in the language of a regular expression.
     17 
     18     Parameters
   (...)
     33 
     34     """
---> 35     fsm = RegexFSM(regex_str, model.tokenizer)
     37     device = model.device
     38     generator = SequenceGenerator(fsm, model, sampler, device)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-6b70ef65-4326-43e1-96be-c40c32b92a8e/lib/python3.10/site-packages/outlines/fsm/fsm.py:121, in RegexFSM.__init__(self, regex_string, tokenizer)
    115         raise ValueError(
    116             "The vocabulary does not allow us to build a sequence that matches the input regex"
    117         )
    119     return states_to_token_maps, empty_token_ids
--> 121 self.states_to_token_maps, self.empty_token_ids = create_states_mapping(
    122     regex_string, tuple(sorted(tokenizer.vocabulary.items()))
    123 )
    124 self.vocabulary = tokenizer.vocabulary.values()
    125 self.eos_token_id = tokenizer.eos_token_id

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-6b70ef65-4326-43e1-96be-c40c32b92a8e/lib/python3.10/site-packages/outlines/caching.py:74, in cache.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
     72 if cache_key in memory:
     73     return memory[cache_key]
---> 74 result = cached_function(*args, **kwargs)
     75 memory[cache_key] = result
     76 return result

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-6b70ef65-4326-43e1-96be-c40c32b92a8e/lib/python3.10/site-packages/outlines/fsm/fsm.py:115, in RegexFSM.__init__.<locals>.create_states_mapping(regex_string, cacheable_vocabulary)
    108 # We make sure that it is possible to generate strings in the language
    109 # of the regular expression with the tokens present in the model's
    110 # vocabulary.
    111 if not any(
    112     regex_fsm.finals.intersection(v.values())
    113     for v in states_to_token_maps.values()
    114 ):
--> 115     raise ValueError(
    116         "The vocabulary does not allow us to build a sequence that matches the input regex"
    117     )
    119 return states_to_token_maps, empty_token_ids

ValueError: The vocabulary does not allow us to build a sequence that matches the input regex


### Outlines/Python version information:

0.0.33
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
absl-py==1.0.0
accelerate==0.27.2
aiohttp==3.8.5
aiosignal==1.3.1
annotated-types==0.6.0
anyio==3.5.0
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
astor==0.8.1
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.3
attrs==23.2.0
audioread==3.0.0
autoawq==0.2.2
autoawq_kernels==0.0.6
azure-core==1.29.1
azure-cosmos==4.3.1
azure-storage-blob==12.17.0
azure-storage-file-datalake==12.12.0
backcall==0.2.0
bcrypt==3.2.0
beautifulsoup4==4.11.1
bitsandbytes==0.42.0
black==22.6.0
bleach==4.1.0
blinker==1.4
blis==0.7.10
boto3==1.24.28
botocore==1.27.28
cachetools==4.2.4
catalogue==2.0.9
category-encoders==2.6.1
certifi==2022.9.14
cffi==1.15.1
chardet==4.0.0
charset-normalizer==2.0.4
click==8.0.4
cloudpickle==2.0.0
cmdstanpy==1.1.0
confection==0.1.1
configparser==5.2.0
convertdate==2.4.0
cryptography==37.0.1
cycler==0.11.0
cymem==2.0.7
Cython==0.29.32
dacite==1.8.1
dataclasses-json==0.5.14
datasets==2.17.1
dbl-tempo==0.1.26
dbus-python==1.2.18
debugpy==1.6.0
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.4
diskcache==5.6.1
distlib==0.3.7
distro==1.7.0
distro-info==1.1+ubuntu0.2
docstring-to-markdown==0.12
einops==0.7.0
entrypoints==0.4
ephem==4.1.4
evaluate==0.4.0
executing==1.2.0
facets-overview==1.0.3
fastapi==0.98.0
fastjsonschema==2.18.0
fasttext==0.9.2
filelock==3.13.1
flash-attn==2.5.0
flatbuffers==23.5.26
fonttools==4.25.0
frozenlist==1.4.0
fsspec==2023.10.0
future==0.18.2
gast==0.4.0
gitdb==4.0.10
GitPython==3.1.27
google-api-core==2.8.2
google-auth==1.33.0
google-auth-oauthlib==0.4.6
google-cloud-core==2.3.3
google-cloud-storage==2.10.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.5.0
googleapis-common-protos==1.56.4
greenlet==1.1.1
grpcio==1.48.1
grpcio-status==1.48.1
gunicorn==20.1.0
gviz-api==1.10.0
h11==0.14.0
h5py==3.7.0
holidays==0.27.1
horovod==0.28.1
htmlmin==0.1.12
httplib2==0.20.2
httptools==0.6.0
huggingface-hub==0.20.3
idna==3.3
ImageHash==4.3.1
imbalanced-learn==0.10.1
importlib-metadata==4.11.3
importlib-resources==6.0.1
interegular==0.3.3
ipykernel==6.17.1
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==7.7.2
isodate==0.6.1
itsdangerous==2.0.1
jedi==0.18.1
jeepney==0.7.1
Jinja2==3.1.3
jmespath==0.10.0
joblib==1.2.0
joblibspark==0.5.1
jsonschema==4.16.0
jupyter-client==7.3.4
jupyter_core==4.11.2
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.0
keras==2.11.0
keyring==23.5.0
kiwisolver==1.4.2
langchain==0.0.217
langchainplus-sdk==0.0.20
langcodes==3.3.0
lark==1.1.9
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy_loader==0.3
libclang==15.0.6.1
librosa==0.10.0
lightgbm==3.3.5
llvmlite==0.38.0
LunarCalendar==0.0.9
Mako==1.2.0
Markdown==3.3.4
MarkupSafe==2.1.5
marshmallow==3.20.1
matplotlib==3.5.2
matplotlib-inline==0.1.6
mccabe==0.7.0
mistune==0.8.4
mleap==0.20.0
mlflow-skinny==2.5.0
more-itertools==8.10.0
mpmath==1.3.0
msgpack==1.0.5
multidict==6.0.4
multimethod==1.9.1
multiprocess==0.70.12.2
murmurhash==1.0.9
mypy-extensions==0.4.3
nbclient==0.5.13
nbconvert==6.4.4
nbformat==5.5.0
nest-asyncio==1.5.5
networkx==3.2.1
ninja==1.11.1.1
nltk==3.7
nodeenv==1.8.0
notebook==6.4.12
numba==0.55.1
numexpr==2.8.4
numpy==1.21.5
nvidia-cublas-cu11==11.11.3.6
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu11==11.8.87
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu11==11.8.89
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu11==8.7.0.84
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu11==10.3.0.86
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu11==11.4.1.48
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu11==11.7.5.86
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu11==2.19.3
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu11==11.8.86
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.0
openai==0.27.8
openapi-schema-pydantic==1.2.4
opt-einsum==3.3.0
outlines==0.0.33
packaging==23.2
pandas==1.4.4
pandocfilters==1.5.0
paramiko==2.9.2
parso==0.8.3
pathspec==0.9.0
pathy==0.10.2
patsy==0.5.2
peft==0.8.2
petastorm==0.12.1
pexpect==4.8.0
phik==0.12.3
pickleshare==0.7.5
Pillow==9.2.0
platformdirs==2.5.2
plotly==5.9.0
pluggy==1.0.0
pmdarima==2.0.3
pooch==1.7.0
preshed==3.0.8
prompt-toolkit==3.0.36
prophet==1.1.4
protobuf==3.19.4
psutil==5.9.0
psycopg2==2.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==15.0.0
pyarrow-hotfix==0.5
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.11.1
pycparser==2.21
pydantic==2.6.2
pydantic_core==2.16.3
pyflakes==3.0.1
Pygments==2.11.2
PyGObject==3.42.1
PyJWT==2.3.0
PyMeeus==0.5.12
PyNaCl==1.5.0
pyodbc==4.0.32
pyparsing==3.0.9
pyright==1.1.294
pyrsistent==0.18.0
pytesseract==0.3.10
python-apt==2.4.0+ubuntu2
python-dateutil==2.8.2
python-dotenv==1.0.0
python-editor==1.0.4
python-lsp-jsonrpc==1.0.0
python-lsp-server==1.7.1
pytoolconfig==1.2.2
pytz==2022.1
PyWavelets==1.3.0
PyYAML==6.0
pyzmq==23.2.0
referencing==0.33.0
regex==2022.7.9
requests==2.28.1
requests-oauthlib==1.3.1
responses==0.18.0
rope==1.7.0
rpds-py==0.18.0
rsa==4.9
s3transfer==0.6.0
safetensors==0.4.2
scikit-learn==1.1.1
scipy==1.9.1
seaborn==0.11.2
SecretStorage==3.3.1
Send2Trash==1.8.0
sentence-transformers==2.2.2
sentencepiece==0.1.99
shap==0.41.0
simplejson==3.17.6
six==1.16.0
slicer==0.0.7
smart-open==5.2.1
smmap==5.0.0
sniffio==1.2.0
soundfile==0.12.1
soupsieve==2.3.1
soxr==0.3.6
spacy==3.5.3
spacy-legacy==3.0.12
spacy-loggers==1.0.4
spark-tensorflow-distributor==1.0.0
SQLAlchemy==1.4.39
sqlparse==0.4.2
srsly==2.4.7
ssh-import-id==5.11
stack-data==0.6.2
starlette==0.27.0
statsmodels==0.13.2
sympy==1.12
tabulate==0.8.10
tangled-up-in-unicode==0.2.0
tenacity==8.1.0
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-profile==2.11.2
tensorboard-plugin-wit==1.8.1
tensorflow==2.11.1
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.33.0
termcolor==2.3.0
terminado==0.13.1
testpath==0.6.0
thinc==8.1.12
threadpoolctl==2.2.0
tiktoken==0.4.0
tokenize-rt==4.2.1
tokenizers==0.15.2
tomli==2.0.1
torch==2.2.1
torchvision==0.14.1+cu117
tornado==6.1
tqdm==4.64.1
traitlets==5.1.1
transformers==4.38.1
triton==2.2.0
typeguard==2.13.3
typer==0.7.0
typing-inspect==0.9.0
typing_extensions==4.10.0
ujson==5.4.0
unattended-upgrades==0.1
urllib3==1.26.11
uvicorn==0.23.2
uvloop==0.17.0
virtualenv==20.16.3
visions==0.7.5
wadllib==1.3.6
wasabi==1.1.2
watchfiles==0.19.0
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==0.58.0
websockets==11.0.3
Werkzeug==2.0.3
whatthepatch==1.0.2
widgetsnbextension==3.6.1
wordcloud==1.9.2
wrapt==1.14.1
xgboost==1.7.6
xxhash==3.3.0
yapf==0.31.0
yarl==1.9.2
ydata-profiling==4.2.0
zipp==3.8.0
zstandard==0.22.0

### Context for the issue:

Library is unusable for me in the current version
rlouf commented 8 months ago

Is this with a fresh virtual environment? I was unable to reproduce the issue.

lapp0 commented 5 months ago

@posionus could you please confirm whether this error still occurs in the latest outlines version?

posionus commented 5 months ago

Not having this issue anymore. I think it was fixed a while ago. You can close this