antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.11k stars 3.69k forks source link

[css3] Grammar very slow due to numerous ambiguities #4256

Open kaby76 opened 5 days ago

kaby76 commented 5 days ago

See https://groups.google.com/g/antlr-discussion/c/8KQC7fwcXpQ

Using the latest ANTLR 4.13.2, with the official example repository's CSS3 grammar (https://github.com/antlr/grammars-v4/tree/master/css3), I generated a C++ parser to parse a simple CSS file of about 200 lines. Without adding any custom listener logic, on an M1 MacBook Pro, it takes 1 second for the first run! I'll consider that as a warm-up, but even after that, it takes 0.14 seconds each time! In contrast, my own homemade approach relying purely on string find only takes 0.002 seconds. Why is there such a performance difference? Is it because I'm using it incorrectly?

As with many of the grammars in the grammars-v4 repo, you never know what to expect.

The problem is that the css3 grammar is ambiguous. To debug the grammar, you can use "trperf" for a quick explanation of the ambiguity or "trparse --ambig" for a slow but detailed explanation of the ambiguity. These tools are from the Trash toolkit. https://github.com/kaby76/Trash. For "trparse --ambig", I had to pare down the input to a few hundred lines, otherwise it would take too long.

To name a few ambiguities, we have Decisions 55 in "ruleSet", 60 in "declaration", 98 in "fontFaceDeclaration", and 112 in "ws".

"ws" is particularly egregious because it is the most time-consuming expression. Using "trparse --ambig", and Bash diff, I found these rules to be a problem.

expr : term (operator_? term)* ;

term : number ws # knownTerm | percentage ws # knownTerm | dimension ws # knownTerm | String ws # knownTerm | UnicodeRange ws # knownTerm | ident ws # knownTerm | var # knownTerm | url ws # knownTerm | hexcolor # knownTerm | calc # knownTerm | function_ # knownTerm | unknownDimension ws # unknownTerm | dxImageTransform # badTerm ;

operator_ : '/' ws # goodOperator | Comma ws # goodOperator | Space ws # goodOperator | '=' ws # badOperator // IE filter and DXImageTransform function ;

For partial input

abbr[title] { border-bottom: 1px dotted; }

we see two parses for "1px",

examples/xxx.css.112: (expr (term (dimension (Dimension "1px")) (ws (Space " "))) (term (ident (Ident "dotted")) (ws))) examples/xxx.css.112: (expr (term (dimension (Dimension "1px")) (ws)) (operator_ (Space " ") (ws)) (term (ident (Ident "dotted")) (ws)))

(dotnet trparse -- --ambig examples/xxx.css | dotnet trxgrep -- ' //expr[.//Dimension[text()="1px"]]' | dotnet trtree -- -a)

A space should not be an operator!

Ken