[Cpp] I found Cpp-runtime is 4x slower than Java-runtime

antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

http://antlr.org

BSD 3-Clause "New" or "Revised" License

17.28k stars 3.3k forks source link

[Cpp] I found Cpp-runtime is 4x slower than Java-runtime #3426

Open tangxing806 opened 2 years ago

tangxing806 commented 2 years ago

[version] antlr 4.9.2 [file size] 113 MB [cpp-runtime spend] 156 s [java-runtime spend] 39s

I am currently using antlr 4.9.2 to develop a parser，The runtime language I choose is cpp。

At that time, when I switched to the java platform (parsr and runtime), I found that the speed increased by 4 times.

Why Cpp-runtime is 4x slower than Java-runtime？I feel weird.

finally，I want to know how to improve parsing speed in C++?

thanks~

KvanTTT commented 2 years ago

Have you tried to run parsing several times? As I know C++ has a very big warmup time (maybe the biggest) but generally, it works very fast.

tangxing806 commented 2 years ago

Is there any way to avoid warmup time？

KvanTTT commented 2 years ago

I don't think so because ANTLR uses an internal cache for effective parsing. Maybe @mike-lischke could help.

mike-lischke commented 2 years ago

Yes, measuring a warm parser is the best option. The C++ target is orders of magnitudes faster than any other target, except on warm up.

KvanTTT commented 2 years ago

BTW, out of curiosity, why warmup is so slow in C++?

jcking commented 2 years ago

C++ warmup occurs during static initialization of the program currently, before main is invoked.

jcking commented 2 years ago

I would be curious to see what gperftools or valgrind show for the C++ runtime and where most of the time is spent.

ericvergnaud commented 2 years ago

C++ warmup occurs during static initialization of the program currently, before main is invoked.

To be more accurate, warmup occurs as the parser encounters text that matches rules not yet encountered. A good performance test is to parse the same text twice, the second pass is guaranteed to be as fast as it can be (since the first pass performed all necessary warmups).

kaby76 commented 2 years ago

After tweaking the cmake files for my driver generator for 4.9.3 C++ with this absolutely critical tip, I ran profiling for grammars-v4/html/examples/abc.com.html for a side-by-side comparison of Java vs C++.

Attached are the hot spot profiles from gprof for G++ and JProfile/Intellij for Java. gprof.txt jprof.txt

Not surprisingly, ParserATNSimulator.closure_() takes up most of the time. Eyeballing the source, I do see some differences:

All this code with side effects missing in the C++ code.
Increment of reachesIntoOuterContext on different sides of some closureBusy() call and continue control flow here in C++ and here in java.
This code always executed in C++ even if config->state is a RuleStopState vs this code sometimes executed in Java only when !config.state instanceof RuleStopState.

I could go on, but clearly, it's comparing apples and oranges. The code is not doing the same computation.

ericvergnaud commented 2 years ago

An accurate way of detecting differences is by counting (with the same grammar and input) the number of calls to closure_ in Java and another target (logged by enabling some debug flags). They should be exactly the same.

mike-lischke commented 2 years ago

Given what Eric wrote: there's not really a full warm up of a parser or lexer. As soon as you feed different input the DFA has to be updated, which is what takes so long (producing the ATN configurations, modifying the synchronized DFA cache etc.).

@KvanTTT Because the DFA cache is shared among all parser instances for a given grammar (and the same holds true for their lexers), access to it must be synchronized to avoid thread race problems (this is btw. the only place in the runtime, which is thread safe). Additionally, because of the mesh like structure of ATN configurations, the C++ target needs to use std::shared_ptr. They are relatively slow to create and in the warmup phase many of them are created and destroyed. This all together makes the warm up phase pretty slow. And the more complex the first input is the longer it takes.

mike-lischke commented 2 years ago

@kaby76 The closure function is part of a heavily recursive set of methods, used for the prediction algorithm (which also creates the ATN configs). And it uses shared_ptr a lot. I had figured a long time ago this to be a good candidate for optimization, but it's not easy to avoid memory leaks there and you need to know the ALL(*) algorithm pretty well to avoid breaking the prediction implementation, when you rework this.

jcking commented 2 years ago

FYI before https://github.com/antlr/antlr4/commit/18f24717c81cc48856d332e2c9492ceaaf17eb1a# ParserATNSimulator did some unnecessary std::shared_ptr copying in a for loop, as did some other places. I do not think that commit is in 4.9.3.

kaby76 commented 2 years ago

@kaby76 The closure function is part of a heavily recursive set of methods, used for the prediction algorithm (which also creates the ATN configs). And it uses shared_ptr a lot. I had figured a long time ago this to be a good candidate for optimization, but it's not easy to avoid memory leaks there and you need to know the ALL(*) algorithm pretty well to avoid breaking the prediction implementation, when you rework this.

@mike-lischke I am definitely not planning on changing any of this code. But, I am interested in performance issues, more generally at the grammar level. The ATN code is tricky, which I found out having written my own backtracking parser using Antlr ATNs for computing a set of lookahead tokens analogous to your antlr4-c3 code. I am writing a transformational system for parser generator grammars--including Antlr--for analyzing and extending grammar (e.g., automatic removal of mutual left recursions, re-adding the ^/! tree construction operators to a super of Antlr4 syntax), and implementing a layer for specifying semantics.