antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.01k stars 3.26k forks source link

C++: Too many ATNConfig Allocations #4518

Closed dhaumann closed 4 months ago

dhaumann commented 8 months ago

We are using ANTLR to write a full interpreter. The language is a VBA dialect (oh my!), which includes some ambiguities. As such, ANTLR definitely has some work to do in order to parse 10000 lines of scripting code. But debugging shows, that for such large files, ANTLR needs >10 seconds for parsing, and essentially nothing for the interpreter step (includes already many optimizations, like avoiding the ParseTree any-interface).

Debugging shows: For these 10000 line script, ANTLR creates 50.000.000 ATNConfig instances, i.e. 50.000.000 allocations. The rule of thumb in C++ is: (Too many) allocations in performance critical code eventually kills all the performance.

Debugging this further by adding a global int counter that is increased in ATNConfig() and decreased in ~ATNConfig(), we can see that the counter reaches only a maximum number of 700543 ATNConfig instances at the same time (which is still a lot, but that certainly is grammar + input dependent).

But this raises the question: Can we safe the other ~49.000.000 allocations?

One trick we could apply here is a cache to reuse ATNConfigs:

std::list<ATNConfig> atnConfigs;
std::vector<ATNConfig*> freeAtnConfigs;

void release(ATNConfig& atnConfig)
{
    atnConfig.clear(); // call some ATNConfig cleanup code
    freeAtnConfigs.push_back(&atnConfig);
}

ATNConfig& get(... atn config params ...)
{
    ATNConfig* atnConfig;
    if (freeAtnConfigs.empty()) {
        atnConfigs.emplace_back(ATNConfig());
        atnConfig = &atnConfigs.back();
    } else {
        atnConfig = freeAtnConfigs.back();
        freeAtnConfigs.pop_back();
    }
    atnConfig->init(... atn config params ...);
    return *atnConfig;
}

Was there any attempt done in this direction? I believe with better memory management, the C++ ANTLR target could be 10 to 100 times faster (that's our goal).

PS: I'm aware of the usage of std::shared_ptr<>, so this somehow needs to be taken care of as well. PPS: I am entering this as issue, and not as discussion, since the memory allocations definitely lead to a performance issue.

kaby76 commented 7 months ago

The problem isn't Antlr, but the grammar. The vba.g4 grammar is very ambiguous. In order to resolve the parse, the grammar needs to fall back to full parser contexts, especially with expression rules like valueStmt. This causes an explosion of ATN configs. The parser is slow because of this.

I would recommend two possible paths forward:

andreasbuhr commented 7 months ago

Hi @kaby76, thanks for your reply. I am working with @dhaumann on the ANTLR based VBA-dialect interpreter.

The two points you mentioned are very valid, we found them and implemented them. We do not work with the vba.g4 provided in the ANTLR-Repos but wrote our own grammar from scratch. We worked with the VBA Specification provided by Microsoft and transformed its content into an ANTLR grammar. We already spent weeks optimizing it. We did never use whitespace in the parser. Transforming the "expression" into a chain of parser rules gave a good speed improvement. We looked at the ANTLR profiler results for quite some hours.

There are some ambiguities which we cannot avoid. Overall, we are quite happy with the result we have. We have correct results for the full body of test code that we have.

Now we see room for improvement in reducing the number of ATNConfig allocations done by ANTLR and we think this effort would also benefit the ANTLR community. It would be great if we could make this happen.

lppedd commented 7 months ago

Sorry, injecting myself here. If you have a grammar and a sample file we can use for benchmark/profiling, it would be great.
I've set up a benchmark framework in https://github.com/Strumenta/antlr-kotlin, which is also the base for ANTLR 5.

andreasbuhr commented 6 months ago

Hi, I created a small benchmark here: https://github.com/andreasbuhr/antlr4-vba-benchmark While this is not exactly what we do in our closed-source code, its performance characteristics are very similar. In VTune we see that it spends most of the time allocating memory: image

If you need anything more or a different benchmark, let me know. @lppedd

jimidle commented 6 months ago

This usually indicates that your grammar is very amiguous and therefore it is continuously going down parsing paths that it then has to backtrack on. When using your grammar, turn on the tracing which will show you when it drops in to full backtracking mode.

Then try to reduce your ambiguities. While ANTR4 will make a parser for just about any grammar, it does not mean that the grammar should not be thought about in terms of ambiguity.

andreasbuhr commented 6 months ago

@jimidle thanks for your help. We already spent weeks on optimizing the grammar. One problem is the inherent ambiguity in VBA. Unfortunately, VBA allows function calls without parentheses. So, given a function "doubleIt", the following expression is valid:

doubleIt 3 + 4

This results in an ambiguity, because both the intepretations doubleIt(3) + 4 and doubleIt(3 + 4) are possible. The 'right' interpretation is chosen through priorities.

As we already went down the path of optimizing the grammar for quite a while, now we'd like to optimize the ANTLR C++ code.

One thing we could try is to use an allocator. So using std::allocate_shared instead of std::make_shared. To this end, I could see myself create a pull request introducing a new function antlrcpp::make_shared, which just forwards to std::make_shared. Having this, we could experiment with different allocation options. Would this be appreciated?

kaby76 commented 6 months ago

Please verify that this is your grammar here: https://github.com/andreasbuhr/antlr4-vba-benchmark/tree/49319e7069ff85bfdc1cc8896ad1f092aaceb31d/grammar

For it, the tool reports the warning rule lExpression contains an optional block with at least one alternative that can match an empty string. That really is a bad thing. It kills performance. You shouldn't take any kind of closure for something that derives empty.

There is a similar warning for argumentList, and other errors in lexer rules.

andreasbuhr commented 6 months ago

@kaby76 as I already explained above, our grammar is closed-source. But the Rubberduck grammar shows similar performance problems when inspecting it with VTune.

andreasbuhr commented 6 months ago

I also fixed these warnings.

kaby76 commented 6 months ago

For your grammar, did you eliminate the ambiguity for expression on input 'IS NOTHING'? In other words, the "Rubberduck" (aka, the split grammar "VBA" at https://github.com/andreasbuhr/antlr4-vba-benchmark/tree/49319e7069ff85bfdc1cc8896ad1f092aaceb31d/grammar) has two derivations for parsing NOTHING from expression. Does you grammar fix that? Is there anything in common between this grammar and your closed grammar for expression, so that we can try to understand why you state that ambiguity cannot be eliminated?

(I also fixed these warnings. The VBALexer.g4 contains numerous errors as well, but that's not the problem here.)

andreasbuhr commented 6 months ago

Our grammar is very different from the Rubberduck grammar. We spent weeks optimizing. With our grammar, the benchmark code at https://github.com/andreasbuhr/antlr4-vba-benchmark/blob/f8c63bfa4d1a1de97952765180981bbf7abac93d/data/specs.bas parses in approx 500ms. It takes 3 to 5 seconds with the Rubberduck grammar.

Almost all ambiguities come from the fact that VBA allows function calls without parentheses. I already gave one example above. Another example is any function call with one argument: foo(bar) can be interpreted as a functioncall with parentheses on the argument "bar" or as a function call without parentheses on the argument "(bar)". This ambiguity goes away as soon as there are two or more arguments. But function calls with only one argument are a problem, as "(bar)" is a perfectly valid expression.

It gets worse if these occur in expressions: foo1(bar1) + foo2(bar2) could be the function "foo1" called on one argument, which is "(bar1) + foo2(bar2)".

We appreciate your efforts to help us improving our ANTLR grammar, but we came here because we see a lot of room for improvements in the C++ implementation of the ANTLR4 runtime and we hope that we can discuss these here. Improvements to the C++ runtime will benefit all ANTLR C++ users. We suspect that the management of ATNConfig objects within ParserATNSimulator can be implemented more efficiently.

andreasbuhr commented 6 months ago

Another example: foo + bar Can be the addition of "foo" and "bar". Or it is a call to the function "foo" on the argument "+ bar".

kaby76 commented 6 months ago

foo + bar. This could be parsed two ways: expr => ID + ID (binary operator) vs expr => func_call + ID =>+ ID + ID. Do you or not disambiguate via semantic predicates that perform a symbol table lookup? How does your grammar make a choice?

andreasbuhr commented 6 months ago

In VBA, it is possible that a function foo users another function bar where bar is defined below function foo. That's why we don't do symbol table lookup during parsing. We make the choice by giving the function call without parentheses always lowest priority.

jimidle commented 6 months ago

I wrote a VBA parser for ANTLR3, some of those ambiguities you need to solve in context rather than trying to get the lever and parser to do it. I will reiterate my main points:

Lexer: As simple as possible, and never causes error. Don't use modes unless embedded languages or very simple triggers Parser: Accepts anything that is valid syntax (or could be) and does not try to solve ambiguity Walker: Uses context to solve ambiguities (in VBA you end up needing a knowledge base)

Defer reporting errors until as high up the chain as possible. Semntic based errors from the walker are usually much better than parser errors, which are better than lever errors (you should not have any of those).

My VBA parser in ANTLR 3 did not need more than two token looked IIRC.

Weeks optimizing a grammar seems a long time. Are you using the various reporting modes at runtime to see where your grammar is going? Also, have you tested the lever on its own?

Jim

On Mar 27, 2024, at 06:24, Andreas Buhr @.***> wrote:

Another example: foo + bar Can be the addition of "foo" and "bar". Or it is a call to the function "foo" on the argument "+ bar".

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4518#issuecomment-2022639336, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMBVGMIUF4SZCBRRDXLY2KT7DAVCNFSM6AAAAABCL2QGJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRSGYZTSMZTGY. You are receiving this because you were mentioned.

andreasbuhr commented 6 months ago

@jimidle thanks a lot for your comment.

How does a function call then look like in the grammar if it doesn't resolve the ambiguities in the parser? Is it just "identifier expression" and inside of the expression you have a comma-operator to collect the arguments?

jimidle commented 6 months ago

I would have to check what I did for the ANTLR 3 grammar, but yes, that sounds about right. Then you walk the tree more than once. Once to build a symbol table such as funds, then a second time to resolve what it is. I cannot remember now if I had to pre-supply a known function table or parsed all the sources in a directory etc. I also had to do VBScriipt at the time, which was almost impossible to do as a parser and not an interpreter

Jim

On Mar 27, 2024, at 09:10, Andreas Buhr @.***> wrote:

@jimidle https://github.com/jimidle thanks a lot for your comment.

How does a function call then look like in the grammar if it doesn't resolve the ambiguities in the parser? Is it just "identifier expression" and inside of the expression you have a comma-operator to collect the arguments?

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4518#issuecomment-2023018985, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMEIJ66J6P626QQJHWDY2LHNHAVCNFSM6AAAAABCL2QGJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRTGAYTQOJYGU. You are receiving this because you were mentioned.

jimidle commented 4 months ago

@parrt I think we can close this