antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.16k stars 3.28k forks source link

fragment rules should be inlined in ATN #3685

Open KvanTTT opened 2 years ago

KvanTTT commented 2 years ago

fragment rules similar to inline functions in programming languages and they should be inlined during ATN generation (if possible, i.e. if they don't have loops). It should improve performance of generated lexers.

Grammar

lexer grammar G;
NUMBER: DIGIT+;
fragment DIGIT: [0-9];

Actual ATN

image

Expected ATN

No ref to DIGIT rule:

image

NUMBER token should be treated as NUMBER: [0-9]+;.

KvanTTT commented 2 years ago

@sharwell wrote:

This sometimes produces a higher performing grammar, but sometimes does not.

I would said in most cases inlining produces higher performing grammars (without prove).

I thought fragment rules are only needed for grammar developers because it's convenient to extract common part to get rid of duplication and developers expect such rules are being inlined. There is no access to internal structure of lexer rule, fragment rules are applicable only for lexer rules.

It can also have weird/unexpected side effects like changing the lookahead distance and recovery characteristics for certain syntax errors.

What else syntax errors you mean except of token recognition error at? I haven't seen anything else for lexer. Also, I don't completely understand what weird/unexpected side effect it can have compare to manual inlining. Inlining is only applicable for fragment rules without loops.

mike-lischke commented 2 years ago

Fragment rules also produce weird behavior when used in a parser. I don't remember many details, as I avoid that scenario by habit already for a long time.

However it all depends on what a fragment rule is publicly know as. Is it considered being a normal rule (with some special conditions applied) or really meant as an externalized rule part, just for convenience? In the first case it should also stay a separate rule (and show up as such in an ATN graph) otherwise inlining could help. But keep in mind, what's displayed as a rule in the graph is in reality just another bunch of ATN states (rule start/end states, and two basic states with a set transition).

parrt commented 2 years ago

@mike-lischke I don't think fragment rules are visible to the parser. They are only meant as helper rules for other Lexer rules. I'm not sure what inlining would do but it would likely expand the size of the ATN, like expanding macros. Given that we are creating DFAs, I also don't see what the performance improvement would be since there's no such thing as a fragment in the DFA. It automatically in lines

KvanTTT commented 2 years ago

@mike-lischke I don't think fragment rules are visible to the parser. They are only meant as helper rules for other Lexer rules.

Exactly. BTW, we have a feature request about fragment parser rules: https://github.com/antlr/antlr4/issues/1963. Anyway I think they should work similar to fragments for parser rules if implemented.

I'm not sure what inlining would do but it would likely expand the size of the ATN, like expanding macros.

Not significantly for most part of grammars if expands. Sometime inlining decreases number of states (the sample in the start message). And I think performance is more important than insignifical increasing of ATN.

Given that we are creating DFAs, I also don't see what the performance improvement would be since there's no such thing as a fragment in the DFA. It automatically in lines

It's not only about DFA but also about ATN matching and execution. Lexer makes a lot of calls to closure and getEpsilonTarget methods during recognition. execATN is called every time even if there is an initialized dfa state in decisionToDFA. The more states ATN have the more calls lexer executes, the more object is allocated (LexerATNConfig). For example, if reference to fragment rule exists, the following code is executed in execATN as well:

case Transition.RULE:
    RuleTransition ruleTransition = (RuleTransition)t;
    PredictionContext newContext =
        SingletonPredictionContext.create(config.context, ruleTransition.followState.stateNumber);
    c = new LexerATNConfig(config, t.target, newContext);
    break;

I can try to measure performance using benchmarks if abovementioned arguments are not enough.

parrt commented 2 years ago

I'm not sure I can agree completely with this analysis. Once we have converted an ATN chunk to DFA for a given input sequence, we never do anymore LexerATNConfig for that. We only touch the ATN again for different char sequences. For a fixed sequence like the following:

fragment A : [ab] ;
X : A 'x' ;

we would visit an extra state or two initially but never again.