Engelberg / instaparse

Eclipse Public License 1.0
2.74k stars 149 forks source link

Parser looping indefinitely on certain syntax error with pre-condition until OutOfMemoryError #220

Closed st-cheewah closed 6 months ago

st-cheewah commented 1 year ago

When parsing the following protocol buffer text, the parser keeps consuming 100% cpu until OutOfMemoryError.

syntax = "proto3";

message Msg1 {

    int32 a = 1 [packed = "true",
                 deprecated = true,
                 (my_option).field1 = 1234,
                 (my_option).field2 = 5678];
    int32 b = 2 [ msg = { abc: 123.4 },
                 whatever = true];
}

If int32 a = .... ; (total 4 lines) is removed, then the parser returns syntax error for line containing msg = { abc: 123.4 },, which is the expected behavior.

If msg = { abc: 123.4 }, is removed then the parser will succeed, which verifies that int32 a = .... ; is valid syntax.

Hence it appears that the earlier text int32 a = .... ; is a pre-condition that somehow caused the parser to loop indefinitely on a syntax error that can otherwise be detected.

More info: The parser uses the following ebnf:

proto = syntax { import | package | option | message | enum | service | <emptyStatement> }

octalDigit3  = #'[0-7][0-7][0-7]'
hexDigit     = #'[0-9a-fA-F]' 
hexDigit2    = #'[0-9a-fA-F][0-9a-fA-F]'
hexDigits    = #'[0-9a-fA-F][0-9a-fA-F]*'

ident = #'[a-zA-Z_][a-zA-Z0-9_]*'
fullIdent = ident { "." ident }
messageName = ident
enumName = ident
fieldName = ident
oneofName = ident
mapName = ident
serviceName = ident
rpcName = ident
messageType = [ "." ] { ident "." } messageName

(* proto2 only: "required" "optional" *)
label = [ "repeated" ]

intLit     = decimalLit | octalLit | hexLit
decimalLit = #'[1-9][0-9]*'
octalLit   = #'0[0-7]*'
hexLit     = ( '0x' | '0X' ) hexDigits

<floatLit> = ( decimals "." [ decimals ] [ exponent ] | decimals exponent | "."decimals [ exponent ] ) | "inf" | "nan"

<decimals>  = #'[0-9]+'
<exponent>  = ( "e" | "E" ) [ "+" | "-" ] decimals

boolLit = "true" | "false" 

strLit = ( <"'"> { charValue } <"'"> ) |  ( <'"'> { charValue } <'"'> )
charValue = hexEscape | octEscape | charEscape |  #'[^\x00\n\\]'
hexEscape = <('\\x' | '\\X')> hexDigit2
octEscape = <'\\'> octalDigit3
charEscape = <'\\'> ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | '\\' | "'" | '"' )

emptyStatement = <(#'[\s\n\r;]*')>

signedFloatLit = [ "-" | "+" ] floatLit;
signedIntLit = [ "-" | "+" ] intLit;
<constant> = fullIdent | signedIntLit | signedFloatLit | strLit | boolLit 

singleQuote = "'"
doubleQuote = '"'
syntax = <"syntax"> <"="> ( <singleQuote> 'proto3' <singleQuote> |
                            <doubleQuote> 'proto3' <doubleQuote> ) <";">

import = <"import"> [ "weak" | "public" ] strLit <";"> 

package = <"package"> fullIdent <";">

option = <"option"> optionName  <"="> constant <";">
optionName = ( ident | <"("> fullIdent <")"> ) { "." ident }

priType = "double" | "float" | "int32" | "int64" | "uint32" | "uint64"
      | "sint32" | "sint64" | "fixed32" | "fixed64" | "sfixed32" | "sfixed64"
      | "bool" | "string" | "bytes"
type = messageType | priType

field = label type fieldName <"="> fieldNumber [ <"["> fieldOptions <"]"> ] <";">

fieldNumber = intLit;
fieldOptions = fieldOption { <",">  fieldOption }
fieldOption = optionName <"="> constant

oneof = <"oneof"> oneofName <"{"> { oneofField | <emptyStatement> } <"}">
oneofField = type fieldName <"="> fieldNumber [ <"["> fieldOptions <"]"> ] <";">

mapField = <"map"> <"<"> keyType <","> type <">"> mapName <"="> fieldNumber [ <"["> fieldOptions <"]"> ] <";">
keyType = "int32" | "int64" | "uint32" | "uint64" | "sint32" | "sint64" |
          "fixed32" | "fixed64" | "sfixed32" | "sfixed64" | "bool" | "string"

<ranges> = range { <","> range }
range =  intLit [ <"to"> ( intLit | "max" ) ]

reserved-ranges = <"reserved"> ranges <";">
reserved-names = <"reserved"> strFieldNames <";">
<reserved> = reserved-ranges | reserved-names;
<strFieldNames> = strFieldName { <","> strFieldName }
<strFieldName> = <"'"> fieldName <"'"> | <'"'> fieldName <'"'>

enum = <"enum"> enumName enumBody
<enumBody> = <"{"> { option | enumField | <emptyStatement> } <"}">
enumField = ident <"="> intLit [ <"["> enumValueOption { <",">  enumValueOption } <"]"> ]<";">
enumValueOption = optionName <"="> constant

message = <"message"> messageName messageBody
<messageBody> = <"{"> { field | enum | message | option | oneof | mapField |
 reserved | <emptyStatement> } <"}">

(* proto2 only: stream *)
service = <"service"> serviceName <"{"> { option | rpc | <emptyStatement> } <"}">

returnType = messageType
rpc = <"rpc"> rpcName <"("> [ "stream" ] messageType <")"> <"returns"> <"("> [ "stream" ]
returnType <")"> (( <"{"> { option | <emptyStatement> } <"}"> ) | <";"> )

with the following used for :auto-whitespace

(def ^:private void
  (insta/parser
   "void = { #'\\s+' | #'(\\/\\*)[\\s\\S]*?(\\*\\/)' | '//' #'.*' }"))
Engelberg commented 1 year ago

The most likely explanation is that there is something about your grammar that creates an exponential explosion of possible parses. When there's a correct solution, it finds it quickly, but when there is an error, it is forced to backtrack through the enormous number of alternative possible parses to prove there is no valid parse. If you're not seeing multiple parse results on your test cases, then the explosion of possibilities is likely occurring in a part of your grammar that is hidden from the final parse. If that is the case, my guess would be that the problem is with your whitespace parser. Most likely, your whitespace definition allows any region of whitespace to potentially be chopped up in a variety of different ways. And in an error situation, all of those possibilities must be tried. When your text is long enough to have a few different whitespace areas, each with multiple possibilities, you get that exponential explosion. So, my suggestion is to do some more testing on void in isolation, non-hidden, on various combinations of whitespace, and see how many possible parses it generates. If you can find a way to write your whitespace parser as a single regex, that may alleviate the problem. Sorry I don't have more time right now to directly help troubleshoot this for you, but hopefully I'm pointing you in the right direction to solve it. Good luck!

st-cheewah commented 1 year ago

Ah I see it could be a side effect of my inefficient grammar. I will investigate further and update the ticket once I have more info. Thanks for the detailed explanation and pointers on how to troubleshoot, much appreciated! PS: tried parsing whitespace with void = #'\\s+' which didn't help; will be trying other ideas