Engelberg / instaparse

Eclipse Public License 1.0
2.74k stars 149 forks source link

Expressing 'not allowed at the end of' - issue with lookbehind regexes #222

Closed judepayne closed 1 year ago

judepayne commented 1 year ago

Hi, I'm wanting to express that none of a series of strings are allowed at the end of a token. I have a regex that does this using negative lookbehinds here. In this case, it's preventing 'fill' 'style' or ' ' being at the end of the string.

(The regex also checks that other strings at not present at all in the target string, i.e. --, -> etc using lookaheads. That works fine).

So my grammar would look like:

k = #'^[^ ](?:(?!--)(?!<-)(?!->)(?!<->)[a-z <>-])*(?<!fill|style| )$' colon
colon = ':'

':' is not allowed in the regex so perhaps I misunderstand how instaparse interacts with regex. I'd assumed insta moves through the stream adding successive characters into the attempted match, stopping and backing off one when a match stops working.

If I remove the negative lookbehind part from the end of regex, the same grammar works as expected. (I'm in a Clojure not Clojurescript environment so no javascript limitations on lookbehind).

Could you help me understand what is going on here please?

If there's another way of expressing string/s not allowed at the end of natively in instaparse with much simpler regexes, that would be an awesome as well.

judepayne commented 1 year ago

I should add that I got quite close with a 'native' solution that looks like:

    k2 = (!key-stop any)+ colon
    key-stop =  colon | ';' | dir | 'fill' | 'style'
     dir =  '--' | '->' | '<-' | '<->'
    any = #'.'
    colon = ':'

but that prevents 'fill' and 'style' from being anywhere in the string rather than just when they are at the end. (The other tokens in key-stop shouldn't be allowed anywhere in the string).

judepayne commented 1 year ago

Hi, I was able to refactor my grammar and leave myself with the requirements to recognise when a string is contained within another string, and secondly when a string is not in a set of strings. The first is easy to do with instaparse lookaheads, e.g.

key = (!key-stop-chars any)+
key-stop-chars = #'[:;]' 
any = #'.'

The second involved a regex with multiple lookaheads, so the overall solution became:

key = (!key-stop-chars not-in)+
key-stop-chars = #'[:;]' 
not-in = #'^(?:(?!^font-color[ ]*$)(?!^width[ ]*$)(?!^height[ ]*$)(?!^shape[ ]*$)..etc..)^.'

Maybe someone else in future will have a similar challenge. If there's a better way of solving this, I'd love to know.

Thank you @Engelberg for this fantastic library. It's quality and longevity are amazing!

Engelberg commented 1 year ago

I'm glad you were able to find a solution!