Open benlipkin opened 9 months ago
+1, getting similar issues with even a subset of json that's flat key/value pairs only.
""" start: dict
dict : "{" [pair ("," pair)*] "}"
pair : ESCAPED_STRING ":" ESCAPED_STRING
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS
"""
This should be high priority in my opinion, for supporting a realistic set of grammars.
As a quick workaround for those interested particularly in JSON, here's what I've been doing.
The unsupported rule from the above grammar is the Lark implementation of ESCAPED_STRING
. I just rewrote this for JSON specifically, which only requires double quotes to be escaped so was a pretty quick rewrite. My new implementation is as follows:
json_grammar = r"""
?start: value
?value: object
| array
| string
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null
array : "[" [value ("," value)*] "]"
object : "{" [pair ("," pair)*] "}"
pair : string ":" value
inner: /([^"]|\\\")+/ |
string : "\"" inner "\""
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS
"""
As a summary, string : "\"" inner "\""
ensures any string is captured by quotes, and inner: /([^"]|\\\")+/ |
supports empty strings, non-empty strings without double quotes, or double quotes only if preceded by a backslash.
This does not globally solve the issue, and this broader problem related to lookarounds remains open, but hopefully this can help folks who are running into this in the meantime. For other languages requiring other characters to be escaped for their string implementations, this approach can be quickly adjusted as well.
Yeah, we could at the very least offer replacements for the common.*
definitions that don't unnecessarily use look-arounds and other non-regular features.
I don't see any non-regular examples in common.lark
other than ESCAPED_STRING
and C_COMMENT
. https://github.com/lark-parser/lark/blob/ab11ead8316978d1d36f35adb6fc88bdae43c2c4/lark/grammars/common.lark
interegular
uses a finite state machine which cannot handle non-regular features such as look-arounds.
We could convert non-regular lark terminals to lark rules with multiple regular lark terminals programmatically at runtime, but this seems like overkill. I think we should just document the fact that "Outlines-lark" is a subset of lark which disallows non-regular terminals.
Another problem I've come across experimenting with the python.lark
grammar is that Lark currently compiles some definitions into regexp with lookarounds https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L631-L655
We either need to monkeypatch TerminalTreeToPattern
into a no-op, or submit a patch to Lark.
Concretely, the following lark definition
HEX_NUMBER.2: "0" ("x" | "X") ("_"? ("0".."9" | "a".."f" | "A".."F"))+
is compiled into
0(?:x|X)(?:(?:_)?(?:[0-9]|[a-f]|[A-F]))+
@lapp0 , I don't think that compiling to DFA precludes the handling of lookaround (from a theoretical perspective). Other features of python regex like backreference are indeed not regular, but lookaround is a pragmatic add-on that doesn't change expressive power. That being said, it doesn't appear trivial to "desugar" in practice, and many regex engines actually do implement lookaround with backtracking (even though it's not theoretically necessary).
Some community discussions I've found on this topic:
It also looks like there's some contemporary work on actually implementing non-backtracking regex engines that can support lookaround, but what I've found so far uses a different symbolic derivative-based formalism, rather than the automata-based formalism of interegular
, so not sure how much is trivially translatable.
All this to say, this isn't a theoretical limitation. I just don't know enough to approach this problem in practice though.
I think your suggestion of avoiding lookaround, providing alternatives for common grammars, and attempting to minimize their construction during lark compilation, seems like probably the best value strategy that should get things most of the way there for real use cases.
Thanks to the maintainers for continuing to work on this issue!
Thanks for the info @benlipkin, glad to see I was mistaken and look-arounds are allowed within a regular language.
I've tried to integrate some common lark grammars into Outlines for https://github.com/outlines-dev/outlines/pull/562 however most of them explicitly or implicitly result in terminals with look-arounds. I'll look into how I might integrate lookarounds into interegular
.
What behavior of the library made you think about the improvement?
Outlines currently uses
interegular
to compile regexps to FSMs, but not all features of python regex are supported, in particular lookarounds.I ran into this issue when trying to test the following grammar:
In the
CFGFSM
constructor, terminal patterns are compiled to regexps so that they can later be used to initializeRegexFSM
proposals, e.g.,RegexFSM(terminal.pattern.to_regexp())
. When the supplied regex includes lookarounds, we get an error such as the following:How would you like it to behave?
It would seem there are at least a couple options here:
interegular
to expand support to lookarounds. This is on the TODOs outlined in their README, and perhaps could use a push.