Open RevanthRameshkumar opened 1 year ago
Oh it is because of the lookaheads, this works:
'(\not|not\(|-)'
This library isn't really designed for direct consumption, for that greenery
is a better choice.
What you are seeing is in fact that expected behavior. Having a lookahead makes the regex ->FSM transformer add extra padding at the end that is always required, since that is necessary when trying to figure out the overlap of regex.
Greenery actually doesn't parse lookaheads from when I tried it last, so your library is more useful to me. Kudos on the parse and fsm class...very useful!
Yes, grennery doesn't handle lookaheads. Those are really hard and problematic to include in a FSM and you better understand why before using them. Adding lookahead support in a way that works for the specific case of intersection checking is the reason I basically copyied over greeneries internals. If you want to do more than that, greenery is the better library to use.
Is there more info on why they are hard? When I do 'not(?=\s)' your lib actually produces the exact fsm I need. Actually I don't care about intersection...the fsm generation is what is important to me!
lookaheads that go beyond the end of the string are... weird to deal with, and especially non-fixed-length lookbehinds aren't regular. I am unsure if non-fixed-length lookaheads are regular. My library deals with those, but potentially not correctly.
What do you want to use the FSMs for?
I'm using it to generate a lark grammar compatible string by picking one token at a time (where the tokens are treated as atomic at any length)
Lark only lets you see next atomic terminals, but no sub terminal options. Combining with fsm lets you go sub terminal.
Both of the above fail. I expect the first to succeed and the second to fail, because the final state for the second should end at the first '-'.
In contrast:
both of the above succeed (but the span for the second one is still (0, 1)