MegaIng / interegular

Allows to check regexes for overlaps. Based on greenery by @qntm.
MIT License
40 stars 6 forks source link

[Bug] found a bug with handling choices #3

Open RevanthRameshkumar opened 1 year ago

RevanthRameshkumar commented 1 year ago
fsm = interegular.parse_pattern("(not(?=\s)|not(?=\()|-)").to_fsm()
assert fsm .accepts('-') == True
assert fsm .accepts('--') == False

Both of the above fail. I expect the first to succeed and the second to fail, because the final state for the second should end at the first '-'.

In contrast:

re.match('(not(?=\s)|not(?=\()|-)', '-') 
re.match('(not(?=\s)|not(?=\()|-)', '--') 

both of the above succeed (but the span for the second one is still (0, 1)

RevanthRameshkumar commented 1 year ago

Oh it is because of the lookaheads, this works:

'(\not|not\(|-)'
MegaIng commented 1 year ago

This library isn't really designed for direct consumption, for that greenery is a better choice.

What you are seeing is in fact that expected behavior. Having a lookahead makes the regex ->FSM transformer add extra padding at the end that is always required, since that is necessary when trying to figure out the overlap of regex.

RevanthRameshkumar commented 1 year ago

Greenery actually doesn't parse lookaheads from when I tried it last, so your library is more useful to me. Kudos on the parse and fsm class...very useful!

MegaIng commented 1 year ago

Yes, grennery doesn't handle lookaheads. Those are really hard and problematic to include in a FSM and you better understand why before using them. Adding lookahead support in a way that works for the specific case of intersection checking is the reason I basically copyied over greeneries internals. If you want to do more than that, greenery is the better library to use.

RevanthRameshkumar commented 1 year ago

Is there more info on why they are hard? When I do 'not(?=\s)' your lib actually produces the exact fsm I need. Actually I don't care about intersection...the fsm generation is what is important to me!

MegaIng commented 1 year ago

lookaheads that go beyond the end of the string are... weird to deal with, and especially non-fixed-length lookbehinds aren't regular. I am unsure if non-fixed-length lookaheads are regular. My library deals with those, but potentially not correctly.

What do you want to use the FSMs for?

RevanthRameshkumar commented 1 year ago

I'm using it to generate a lark grammar compatible string by picking one token at a time (where the tokens are treated as atomic at any length)

RevanthRameshkumar commented 1 year ago

Lark only lets you see next atomic terminals, but no sub terminal options. Combining with fsm lets you go sub terminal.