Parsing performance orders of magnitude slower with 4.3 vs 3.4

antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

http://antlr.org

BSD 3-Clause "New" or "Revised" License

17.23k stars 3.29k forks source link

Parsing performance orders of magnitude slower with 4.3 vs 3.4 #658

Closed arifogel closed 10 years ago

arifogel commented 10 years ago

I started with an ANTLR 3 grammar (with several subordinate grammars) for parsing Cisco configuration files. It usually gets through all of my test files instantaneously from a user perspective. In the last couple weeks I ported to ANTLR 4, and am now using 4.3 specifically. I did end up making some structural changes to the grammar(s), but nothing too crazy. Now it takes from .5s to 1s per file for the parsing stage (calling the <parser>.<start_rule>() function). I tried setting prediction mode to SLL, but it makes no difference. I'd be happy to provide more information if requested, but I don't know where to start.

Please help me to optimize parsing performance for my grammar. The various versions of the grammar are in my github repository (arifogel/batfish). You can look at the .g files in the batfish.grammar.cisco package in the initial commit of the master branch for the v3 version. The .g4 files in that package in the latest commit contain my v4 grammars.

Thanks for any help you can provide.

sharwell commented 10 years ago

I haven't run the performance analysis on it, but I did notice that on the following line, it appears that 404.0 would not be recognized as a FLOAT.

https://github.com/arifogel/batfish/blob/bcdcc7b7f414273ed954e0484d5bb7f738fde9be/projects/batfish/src/batfish/grammar/cisco/CiscoGrammarCommonLexer.g4#L3231

I believe the rule was meant to be structured as (note the new location of the * character):

FLOAT
  : {!inComment}? F_PositiveDigit F_Digit* '.' F_Digit+
  ;

sharwell commented 10 years ago

Performance of the lexer would be improved if the following predicate and character were reversed:

https://github.com/arifogel/batfish/blob/bcdcc7b7f414273ed954e0484d5bb7f738fde9be/projects/batfish/src/batfish/grammar/cisco/CiscoGrammarCommonLexer.g4#L3446-L3448

As written, the predicate will be evaluated for every letter, digit, or symbol appearing in a VARIABLE. If you instead write ':' {!enableIPV6_ADDRESS}?, the predicate will only be evaluated if/when a : character appears in the variable.

The other semantic predicates in the lexer are not as intrusive, but they could still be improved.

For example, {foo}? F_Digit+ is more efficiently written as either F_Digit+ {foo?} or F_Digit {foo}? F_Digit*.

sharwell commented 10 years ago

When performance is a concern, avoid using non-greedy operators, especially in parser rules.

https://github.com/arifogel/batfish/blob/master/projects/batfish/src/batfish/grammar/cisco/CiscoGrammar.g4#L23

arifogel commented 10 years ago

First of all, I wanted to thank you for your quick response, and to apologize for taking so long to get back to you. I implemented your suggested changes (except for the FLOAT thing), and did not notice any improvement in performance. I should point out that lexing is not the bottleneck, so I doubt any changes to the lexer will make much difference, unless they serve to reduce the number of tokens in the resulting stream. At this point I would like to try more complex methods for analyzing the performance issues, e.g. profiling, but I do not know where to start. Can you suggest next steps?

sharwell commented 10 years ago

I was able to build the project in your GitHub repository. Can you provide a sample method which I can use to run a test which demonstrates the slow behavior?

arifogel commented 10 years ago

The build instructions in the README are far more than what is necessary for you to do to just test the parser. Try the following steps:

Check out commit 5eea3ca5082d199413b638d391a62d6dd0112c9f (HEAD of 'unstable' branch as of this reply).
Source batfish_functions.sh: . /tools/batfish_functions.sh
Build non-logicblox portion of project: batfish_build
parse example test rig: batfish -testrig /test_rigs/example -sv -svpath

If you want to use an IDE (e.g. Eclipse) to debug, the 'main' function is in 'Driver.java'. Use the same command line options as in step 4.

Edit: To build in Eclipse you must import the project located in /projects/batfish. If you use ANTLR4IDE, you must uncheck "Tool is activated" in Window=>Preferences=>ANTLR 4=>Tool

On 08/03/2014 07:20 PM, Sam Harwell wrote:

I was able to build the project in your GitHub repository. Can you provide a sample method which I can use to run a test?

— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/658#issuecomment-51011927.

arifogel commented 10 years ago

Oh wow, I totally misread that as you saying you were UNABLE to build my project :P

On August 3, 2014 7:20:39 PM PST, Sam Harwell notifications@github.com wrote:

I was able to build the project in your GitHub repository. Can you provide a sample method which I can use to run a test?

Reply to this email directly or view it on GitHub: https://github.com/antlr/antlr4/issues/658#issuecomment-51011927

sharwell commented 10 years ago

I ran the test as you described, and observed that the parser is reporting a large number of syntax errors while parsing the files. Is this what you are observing as well?

sharwell commented 10 years ago

Nevermind, I needed to update the lexer to support Windows line endings.


fragment
F_Newline
:
    '\n'
    | '\r' '\n'?
;

sharwell commented 10 years ago

I recorded a 24.828 sec execution of Batfish.parseVendorConfigurations (I added a loop around the code which parses input files). Of that, 24.406 sec. was spent in Lexer.nextToken. I would say you most certainly have a lexer problem.

sharwell commented 10 years ago

If you get rid of semantic predicates in your lexer, and replace them with use of lexer modes, your performance problem is completely resolved. Prior to that, you can get an approximate 10:1 performance improvement by adjusting each lexer rule which contains a semantic predicate at the beginning of the rule to instead contain the predicate after a character has already been matched. For example, the following rule:

FOO
  : {stuff}? DIGIT+
  ;

Would instead be written like this:

FOO
  : DIGIT {stuff}? DIGIT*
  ;

arifogel commented 10 years ago

I followed your suggestions, and the speed improved greatly. Thank you for your help. On 08/04/2014 02:24 PM, Sam Harwell wrote:

If you get rid of semantic predicates in your lexer, and replace them with use of lexer modes, your performance problem is completely resolved. Prior to that, you can get an approximate 10:1 performance improvement my adjusting each lexer rule which contains a semantic predicate at the beginning of the rule to instead contain the predicate after a character has already been matched. For example, the following rule:

FOO : {stuff}? DIGIT+ ;

Would instead be written like this:

FOO : DIGIT {stuff}? DIGIT* ;

— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/658#issuecomment-51119759.