dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.43k stars 479 forks source link

space in generated text despite of float-format constraint results in convert-to-float failure #1178

Open katossky opened 1 month ago

katossky commented 1 month ago

Describe the issue as clearly as possible:

With google/mt5-large and the float constraint, spaces are generated, and the conversion to float fails.

Steps/code to reproduce the bug:

import outlines as ol
from transformers import AutoModelForSeq2SeqLM

llm = ol.models.transformers(
    "google/mt5-large",
    device = 'mps',
    model_class = AutoModelForSeq2SeqLM
)

generator = ol.generate.format(llm, float)
generator("Gimme your favourite number:", seed = 3)

Expected result:

No error

Error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/katossky/Projets/leximpact-rapport/.venv/lib/python3.12/site-packages/outlines/generate/api.py", line 512, in __call__
    return self._format(completions)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/katossky/Projets/leximpact-rapport/.venv/lib/python3.12/site-packages/outlines/generate/api.py", line 488, in _format
    return self.format_sequence(sequences)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/katossky/Projets/leximpact-rapport/.venv/lib/python3.12/site-packages/outlines/fsm/types.py", line 45, in float_format_fn
    return float(sequence)
           ^^^^^^^^^^^^^^^
ValueError: could not convert string to float: '99.3926785111 3 002 16000000000000'

Outlines/Python version information:

Version information

``` 0.0.47.dev87+g6035e86 Python 3.12.5 (main, Aug 6 2024, 19:08:49) [Clang 15.0.0 (clang-1500.3.9.4)] absl-py==2.1.0 accelerate==0.34.2 aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 airportsdata==20240806 annotated-types==0.7.0 anyio==4.4.0 appnope==0.1.4 asttokens==2.4.1 astunparse==1.6.3 attrs==24.2.0 certifi==2024.8.30 charset-normalizer==3.3.2 chex==0.1.86 cloudpickle==3.0.0 comm==0.2.2 datasets==3.0.0 debugpy==1.8.5 decorator==5.1.1 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 etils==1.9.4 executing==2.1.0 filelock==3.16.0 flatbuffers==24.3.25 flax==0.9.0 frozenlist==1.4.1 fsspec==2024.6.1 gast==0.6.0 google-pasta==0.2.0 grpcio==1.66.1 guidance==0.1.16 h11==0.14.0 h5py==3.11.0 httpcore==1.0.5 httpx==0.27.2 huggingface-hub==0.24.7 humanize==4.10.0 idna==3.9 importlib_resources==6.4.5 interegular==0.3.3 ipykernel==6.29.5 ipython==8.27.0 jax==0.4.31 jaxlib==0.4.31 jedi==0.19.1 Jinja2==3.1.4 jiter==0.5.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 jupyter_client==8.6.3 jupyter_core==5.7.2 keras==3.5.0 lark==1.2.2 libclang==18.1.1 llama_cpp_python==0.2.90 llvmlite==0.43.0 lmql==0.7.3 Markdown==3.7 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib-inline==0.1.7 mdurl==0.1.2 ml-dtypes==0.4.1 mpmath==1.3.0 msgpack==1.1.0 multidict==6.1.0 multiprocess==0.70.16 namex==0.0.8 nest-asyncio==1.6.0 networkx==3.3 ninja==1.11.1.1 numba==0.60.0 numpy==1.26.4 openai==1.45.0 opt-einsum==3.3.0 optax==0.2.3 optree==0.12.1 orbax-checkpoint==0.6.3 ordered-set==4.1.0 outlines @ git+https://github.com/dottxt-ai/outlines@6035e86ac8089d4f8aeab07ea116093a2ed0e03e packaging==24.1 pandas==2.2.2 parso==0.8.4 pexpect==4.9.0 platformdirs==4.3.3 prompt_toolkit==3.0.47 protobuf==4.25.4 psutil==6.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 pyairports==2.1.1 pyarrow==17.0.0 pycountry==24.6.1 pydantic==2.9.1 pydantic_core==2.23.3 Pygments==2.18.0 python-dateutil==2.9.0.post0 pytz==2024.2 PyYAML==6.0.2 pyzmq==26.2.0 quanto==0.2.0 referencing==0.35.1 regex==2024.9.11 requests==2.32.3 rich==13.8.1 rpds-py==0.20.0 safetensors==0.4.5 scipy==1.14.1 sentencepiece==0.2.0 setuptools==74.1.2 six==1.16.0 sniffio==1.3.1 stack-data==0.6.3 sympy==1.13.2 tensorboard==2.17.1 tensorboard-data-server==0.7.2 tensorflow==2.17.0 tensorstore==0.1.65 termcolor==2.4.0 tf_keras==2.17.0 tiktoken==0.7.0 tokenizers==0.19.1 toolz==0.12.1 torch==2.4.1 tornado==6.4.1 tqdm==4.66.5 traitlets==5.14.3 transformers==4.44.2 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.3 wcwidth==0.2.13 Werkzeug==3.0.4 wheel==0.44.0 wrapt==1.16.0 xxhash==3.5.0 yarl==1.11.1 zipp==3.20.2 ```

Context for the issue:

It defeats the purpose of getting expected coercible outputs out of outlines. I have nothing to gain from google/mt5-large specifically (I am comparing multilingual LLMs for a scholar thing) but it looks like the same happens for all the T5 and mT5 family, indicating that there might be a larger problem (?).

katossky commented 1 month ago

Don't know if it is related but with meta-llama/Meta-Llama-3.1-8B-Instruct (lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF / Meta-Llama-3.1-8B-Instruct-Q8_0.gguf) I even have word tokens generated.

ValueError: could not convert string to float: '581 Volunteer Volunteer Volunteer Volunteer Volunteer Volunteer Volunteer Volunteer'