Open duming opened 3 months ago
Thanks for the well documented issue!
This appears to be an issue with our enum handling in json_schema.py
, specifically calling json.dumps
>>> PersonInfo.model_json_schema()
{'$defs': {'Emotion': {'enum': ['开心', '难过', '普通'], 'title': 'Emotion', 'type': 'string'}}, 'properties': {'心情': {'$ref': '#/$defs/Emotion'}}, 'required': ['心情'], 'title': 'PersonInfo', 'type': 'object'}
>>> s = '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'
>>> json.dumps(s)
'"\\\\{[ ]?\\"\\u5fc3\\u60c5\\"[ ]?:[ ]?(\\"\\u5f00\\u5fc3\\"|\\"\\u96be\\u8fc7\\"|\\"\\u666e\\u901a\\")[ ]?\\\\}"'
Thank you very much for replying so quickly. I was wondering if you have any plans to address it soon?
Describe the issue as clearly as possible:
When using JSON-structured generation enum type with Non-ASCII character not working properly. Non-ASCII characters(like Chinese) will be force to encode to ASCII characters. This behavior leads to much slower generation speed and much worse performance.
The example code in the below section won't cause an error. It's just an example to debug. It's more clear to check the direct output of the LLM model. For example the line #225 in api.py
In this example , Expected output is '开心' which equals to 2 token_ids, the actual output is "\u5f00\u5fc3" which equals to 14 token_ids. The expected regex_str is '\{[ ]?"心情"[ ]?:[ ]?("开心"|"难过"|"普通")[ ]?\}'. The actual regex_str is '\{[ ]?"心情"[ ]?:[ ]?("\\u5f00\\u5fc3"|"\\u96be\\u8fc7"|"\\u666e\\u901a")[ ]?\}'
Althogh after format_sequence the output seems to become correct characters again. This behavior is absolutly not correct. Two reasons:
quick fix: https://github.com/outlines-dev/outlines/blob/5e8f7709e3cecd02943120ed01420f00159cedbc/outlines/fsm/json_schema.py#L275 replace this line with choices.append(re.escape(json.dumps(choice, ensure_ascii=False))) will fix this problem but i dont know will this cause any other problems.
Steps/code to reproduce the bug:
Expected result:
Error message:
No response
Outlines/Python version information:
Version information
Context for the issue:
Reduce inference speed Reduce LLM performance