dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.85k stars 503 forks source link

Enum type with Non-ASCII character not working properly #1092

Open duming opened 3 months ago

duming commented 3 months ago

Describe the issue as clearly as possible:

When using JSON-structured generation enum type with Non-ASCII character not working properly. Non-ASCII characters(like Chinese) will be force to encode to ASCII characters. This behavior leads to much slower generation speed and much worse performance.

The example code in the below section won't cause an error. It's just an example to debug. It's more clear to check the direct output of the LLM model. For example the line #225 in api.py 1723437077555

In this example , Expected output is '开心' which equals to 2 token_ids, the actual output is "\u5f00\u5fc3" which equals to 14 token_ids. The expected regex_str is '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'. The actual regex_str is '\{[ ]?"心情"[ ]?:[ ]?("\\u5f00\\u5fc3"|"\\u96be\\u8fc7"|"\\u666e\\u901a")[ ]?\}'

Althogh after format_sequence the output seems to become correct characters again. This behavior is absolutly not correct. Two reasons:

  1. The token length increases from 2 to 14. Time cost increases
  2. LLM models are not trained to sample a string like "\\u5f00\\u5fc3". It's garuanteed to be out of distribution. Result in more hallucinations

quick fix: https://github.com/outlines-dev/outlines/blob/5e8f7709e3cecd02943120ed01420f00159cedbc/outlines/fsm/json_schema.py#L275 replace this line with choices.append(re.escape(json.dumps(choice, ensure_ascii=False))) will fix this problem but i dont know will this cause any other problems.

Steps/code to reproduce the bug:

class Emotion(str, Enum):
    happy = "开心"
    upset = "难过"
    normal = "普通"

class PersonInfo(BaseModel):
    心情: Emotion

model = outlines.models.transformers("/root/Phi-3-mini-128k-instruct",device='cuda:0')
query = "How do you feel today?"
print(query)

generator = outlines.generate.json(model, json.dumps(PersonInfo.model_json_schema(), ensure_ascii=False))
ret = generator(query)
print(ret)

Expected result:

'\\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\\}'
['{ "心情" :"普通" }']

Error message:

No response

Outlines/Python version information:

Version information

``` 0.0.46 Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] aiohappyeyeballs==2.3.5 aiohttp==3.10.2 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==24.2.0 certifi==2024.7.4 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 cmake==3.30.2 datasets==2.20.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 exceptiongroup==1.2.2 fastapi==0.112.0 filelock==3.15.4 frozenlist==1.4.1 fsspec==2024.5.0 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.24.5 idna==3.7 interegular==0.3.3 Jinja2==3.1.4 jiter==0.5.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 lark==1.1.9 llvmlite==0.43.0 lm-format-enforcer==0.10.3 MarkupSafe==2.1.5 mpmath==1.3.0 msgpack==1.0.8 multidict==6.0.5 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.3 ninja==1.11.1.1 numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.555.43 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.20 nvidia-nvtx-cu12==12.1.105 openai==1.40.2 outlines==0.0.46 packaging==24.1 pandas==2.2.2 pillow==10.4.0 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.20.0 protobuf==5.27.3 psutil==6.0.0 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==17.0.0 pyarrow-hotfix==0.6 pycountry==24.6.1 pydantic==2.8.2 pydantic_core==2.20.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytorch-fast-transformers==0.4.0 pytz==2024.1 PyYAML==6.0.2 pyzmq==26.1.0 ray==2.34.0 referencing==0.35.1 regex==2024.7.24 requests==2.32.3 rpds-py==0.20.0 safetensors==0.4.4 sentencepiece==0.2.0 six==1.16.0 sniffio==1.3.1 starlette==0.37.2 sympy==1.13.1 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.4.0 torchvision==0.19.0 tqdm==4.66.5 transformers==4.44.0 triton==3.0.0 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 uvicorn==0.30.5 uvloop==0.19.0 vllm==0.5.4 vllm-flash-attn==2.6.1 watchfiles==0.23.0 websockets==12.0 xformers==0.0.27.post2 xxhash==3.4.1 yarl==1.9.4 ```

Context for the issue:

Reduce inference speed Reduce LLM performance

lapp0 commented 3 months ago

Thanks for the well documented issue!

This appears to be an issue with our enum handling in json_schema.py, specifically calling json.dumps

https://github.com/outlines-dev/outlines/blob/60e89f5706e3d0f9837e271e04a39fb6e81d92df/outlines/fsm/json_schema.py#L275

>>> PersonInfo.model_json_schema()
{'$defs': {'Emotion': {'enum': ['开心', '难过', '普通'], 'title': 'Emotion', 'type': 'string'}}, 'properties': {'心情': {'$ref': '#/$defs/Emotion'}}, 'required': ['心情'], 'title': 'PersonInfo', 'type': 'object'}
>>> s = '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'
>>> json.dumps(s)
'"\\\\{[ ]?\\"\\u5fc3\\u60c5\\"[ ]?:[ ]?(\\"\\u5f00\\u5fc3\\"|\\"\\u96be\\u8fc7\\"|\\"\\u666e\\u901a\\")[ ]?\\\\}"'
duming commented 3 months ago

Thank you very much for replying so quickly. I was wondering if you have any plans to address it soon?