Closed liqul closed 3 months ago
The pattern in main
changed yesterday:
# allow `\"`, `\\`, or any character which isn't a control sequence
STRING_INNER = r'([^"\\\x00-\x1F\x7F-\x9F]|\\["\\])'
STRING = f'"{STRING_INNER}*"'
For a valid string inner, compliant with json, we need to ensure a character is either
" a \n\t a"
isn't valid) \\
) or escape a quote (\"
)Would be great if we could simplify it though, however [\w ]
doesn't work, e.g.
>>> print(re.match(r'[\w ]', "!"))
None
You can test your new STRING_INNER
here https://github.com/outlines-dev/outlines/blob/main/tests/fsm/test_json_schema.py#L117-L132
Thanks for the quick response. I know that my simplification has many limitations. But the issue I observed is that the correctness of string
properties in the generated JSON object are constantly low, compared to other types, even the value is quite common sense. I was not suggesting changing the definitions.
I've run into a similar issue.
I think there are two solutions here:
constr(regex=r'[\w ]*')
instead of str
for pydantic fields. Or perhaps outlines.types.smart_str
which requires the first character to be alphanumeric to improve generation quality?Let me know if this makes sense to you or if you have any other ideas
I'm comparing different libraries for constrained generation. So, I'm using the same model and the same prompt+schema with different approaches. That's why I found that outlines achieves a relatively low performance for string
types compared to other libraries, and I'm sure that this is not solely a problem of the model.
The interesting part is that different libraries, although based on a similar underlying logits processing technique, have different implementations of the regex for the string type. That's why I believe recommending a different default regex could improve the performance.
That's why I found that outlines achieves a relatively low performance for string types compared to other libraries, and I'm sure that this is not solely a problem of the model.
That's very interesting. Could you link the libraries which perform best in your experiments? Or is it simply [\w ]
as you described?
You can take a look at this one https://github.com/noamgat/lm-format-enforcer
Looking at lm-format-enforcer, it seems they allow any token to be produced other than quote. I'm not sure what is making outlines perform worse, but experimenting with better string patterns and pydantic.constr
is definitely worth doing.
If we guarantee all strings start with an alphanumeric for the first character (but don't constrain it otherwise thereafter) the output is much better.
_any_alphanum = r'[^\W_]'
_any_string_inner = r'([^"\\\x00-\x1F\x7F-\x9F]|\\["\\])'
smart_string = f"({_any_alphanum}{_any_string_inner}*)?"
full pattern '([^\\W_]([^"\\\\\\x00-\\x1F\\x7F-\\x9F]|\\\\["\\\\])*)?'
e.g. {'name': {'type': 'string', 'description': 'The name of the airport.', 'pattern': '([^\\W_]([^"\\\\\\x00-\\x1F\\x7F-\\x9F]|\\\\["\\\\])*)?'}
Output:
{
"name": "Los Angeles International Airport",
"IATA": "LAX",
"ICAO": "KLAX",
"location": {
"city": "Los Angeles",
"country": "United States",
"coordinates": {
"latitude": 33.9416,
"longitude": -118.4085
}
},
"timezone": "America/Los_Angeles"
}
You likely also will get better results if you apply a chat template per https://github.com/outlines-dev/outlines/issues/987
@liqul instruction-finetuned models tend to be annoyingly template-dependent and the more they are finetuned, the worse the problem gets
imo it would also be interesting to measure, thru ablation benchmarks, how much applying/not applying the chat template affects model performance
@lapp0 Cool, I didn't realized the chat template is not by default applied in outlines. I believe this can improve the generation quality though I haven't tried it.
@lapp0 Cool, I didn't realized the chat template is not by default applied in outlines. I believe this can improve the generation quality though I haven't tried it.
It isn't applied by default yet, the issue still hasn't been implemented. It appears to almost always improve quality though and should be applied based on my observations :)
Describe the issue as clearly as possible:
I'm not sure if I missed anything. Basically, I want to extract information from a provided paragraph based on a JSON schema. When the schema contains properties in string type, the output values are wrong like ", " or ": ". I looked into the implementation in
json_schema.py
and can see the default regex for string is defined byIf I change the definintion to something simpler like r'[\w ]', the performance seems getting better, but I didn't tested comprehensively. I'm not sure if you have tested this scenario before and what might be causing this issue.
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information 0.0.45
Context for the issue:
The provided example is not extremely bad. Sometimes, many properties of string type are ", ", like shown below: