Engelberg / instaparse

Eclipse Public License 1.0
2.74k stars 149 forks source link

Bugs with end of file detection #191

Closed green-coder closed 5 years ago

green-coder commented 5 years ago

I found a strange behavior with #'\\Z', I wonder if it is a bug.

((insta/parser
   "Paragraph = NonBlankLine+ BlankLine+
    BlankLine = #'[ \\t]'* EOL
    NonBlankLine = #'\\S'+ EOL
    EOL = (#'\\n' | EOF)
    EOF = #'\\Z'")
 "abc\ndef\n")

;; The "end of file" is matched before "\n" in the parsed result.
=> 
[:Paragraph
 [:NonBlankLine "a" "b" "c" [:EOL "\n"]]
 [:NonBlankLine "d" "e" "f" [:EOL [:EOF ""]]] ; <-- here
 [:BlankLine [:EOL "\n"]]]                    ; <-- and here

This other approach which uses the negative lookahead does put the "\n" in the right place in the result, but there is another problem: The BlankLine is missing in the result. That may be a bug of instaparse.

((insta/parser
   "Paragraph = NonBlankLine+ BlankLine+
    BlankLine = #'[ \\t]'* EOL
    NonBlankLine = #'\\S'+ EOL
    EOL = (#'\\n' | EOF)
    EOF = !#'.'")
 "abc\ndef\n")

=>
[:Paragraph [:NonBlankLine "a" "b" "c" [:EOL "\n"]]
            [:NonBlankLine "d" "e" "f" [:EOL "\n"]]]
;; There is no BlankLine anymore in the result, but parser says it matches.

I am using the version 1.4.9 of instaparse.

Engelberg commented 5 years ago

In general, I've never used #"\Z". I don't offhand see how it would be useful, since instaparse is always going to try to match against the whole string anyway. : But if that's what you want to do, I think you want to make the Z a lower-case z. https://stackoverflow.com/questions/2707870/whats-the-difference-between-z-and-z-in-a-regular-expression-and-when-and-how

The upper-case one matches both before and after the final newline character. That's Java (and therefore Clojure) behavior:

> (re-seq #"\Z" "\n")
("" "")

Some other options available to you are:

As for the negative lookahead example, there's nothing in your grammar to force that it must end with an EOF, so the parse it produced is perfectly valid. Also, watch out: in Java/Clojure the default behavior of #"." is that the . isn't matched by newline characters.