Concurrent Parsing Bottlenecks

I'm running into a similar slowdown to the one reported a few years ago here: https://github.com/antlr/antlr4/issues/2454.

Specifically, I'm using ANTLR4 for a variety of syntax analysis tasks, in a batch-processing tool that is trying to leverage a high core-count machine to improve performance. Unfortunately, when I increase the number of cores, I see that both the shared DFA cache and the prediction context cache can wind up as bottlenecks, and frequently idle N - 1 available cores, which a stack dump shows are all stuck waiting for access to shared maps. addDFAState is a particularly problematic spot.

So at present I'm essentially passing individual copies of each of these caches to each newly-created parser, as so:

fun Parser.makeThreadSafe() {
    interpreter = ParserATNSimulator(
        this,
        this.atn,
        interpreter.decisionToDFA.mapIndexed { i, _ -> DFA(atn.getDecisionState(i), i) }.toTypedArray(),
        PredictionContextCache()
    )
}

This is particularly problematic when using grammars that have poor performance, such as the official Kotlin G4 grammar, which can initially take multiple seconds to process even small inputs. Normally utilizing the caches mentioned above results in an order-of-magnitude speedup once the grammar gets warm. However, keeping all cores active requires disabling those caches, which is currently resulting in a roughly 3x slowdown at least on some inputs according to my testing. This is on a data processing pipeline that runs for days, so that 3x does end up hurting a bit.

So I'm in a pickle. If I was optimizing for primarily single-core performance I'd use the caches and live with the (infrequent) bottlenecks. But for a concurrent workload I'm stuck with either a 1 fast(er) parser or N slow(er) parsers.

I took a look at the optimized ANTLR4 fork, which apparently has lock-free maintenance of these shared structures. However, the official optimized fork is still on v4.9, and the most recently-maintained fork of the fork is only up to v4.11, and both these versions cause compatibility problems with other tools that are part of our library, specifically checkstyle which is expecting a different ATN serialization version.

I've had a few other ideas about how to address this but I'm not sure whether they'd work properly. One would be to maintain N copies of the caches and let each thread check out one that's not in use. This would cause them to warm up more slowly, but that might be an OK tradeoff, since at least for the Kotlin grammar the performance improves quite quickly. Unfortunately this is slightly complicated by the fact that we use multiple different parsers in the project, and I assume that the decisionToDFA array is parser-specific. It might work, but I'd be happier if someone has tried something like this and can report that it can in fact be done.

Another option would be to allow each parser to grab its own copy of the caches on startup and then let them update them atomically once they're done. This would also cause the caches to warm more slowly, but at least each thread would get the benefits of some of the prior parsing passes. The problem here is that I'm unsure how to copy either shared structure, and they don't seem to be set up to do this even in memory. (I'm not worried about saving state across restarts.)

You could also run a few test inputs during warmup to populate the prediction caches and then at some point just start using copies and discarding future updates. But this would also require being able to copy these structures.

Any help or guidance would be appreciated! Thanks in advance. ANTLR is a great project, and if anything these bottlenecks are a sign of how often we're using it throughout our codebase...

I have a few questions, and some observations.

Questions:

Which kotlin grammar are you referring to? There are two: kotlin/kotlin and kotlin/kotlin-formal?
Which target are you referring to? It seems to be C# because you mentioned the "optimized" fork.

Observations:

I removed testing of the grammar in the desc.xml for kotlin/kotlin, so officially it doesn't work. But I just tested it, and it seems to work very slowly. For kotlin/kotlin-formal, while the desc.xml says it works for most targets, the performance is slower than kotlin/kotlin.

07/03-06:00:50 ~/issues/g4-3539/kotlin/kotlin/Generated-CSharp
$ ./bin/Debug/net7.0/Test.exe examples/Test.kt
CSharp 0 examples/Test.kt success 1.9808688
Total Time: 2.0724118
07/03-06:00:58 ~/issues/g4-3539/kotlin/kotlin/Generated-CSharp
$ pushd
~/issues/g4-3539/kotlin/kotlin-formal/Generated-CSharp ~/issues/g4-3539/kotlin/kotlin/Generated-CSharp
07/03-06:01:01 ~/issues/g4-3539/kotlin/kotlin-formal/Generated-CSharp
$ ./bin/Debug/net7.0/Test.exe examples/Test.kt
CSharp 0 examples/Test.kt success 5.6004534
Total Time: 5.6918796
07/03-06:01:09 ~/issues/g4-3539/kotlin/kotlin-formal/Generated-CSharp
$

The kotlin/kotlin grammar has a lot of ambiguities. This is where one should first focus in order to speed up the parser.

Decision  Rule                        Invocations  Time      Total-k  Max-k  Fallback  Ambiguities  Errors  Transitions
0         kotlinFile                  4            0.22318   23       5      3         3            0       13
1         kotlinFile                  0            0         0        0      0         0            0       0
2         kotlinFile                  336          0.699777  523      1      187       187          0       200
3         kotlinFile                  0            0         0        0      0         0            0       0
4         kotlinFile                  0            0         0        0      0         0            0       0
5         kotlinFile                  0            0         0        0      0         0            0       0
6         script                      0            0         0        0      0         0            0       0
7         script                      0            0         0        0      0         0            0       0
8         script                      0            0         0        0      0         0            0       0
...

According to the generated .dot file, there's a loop containing anysemi for one transition and epsilon for all others for decision 3 of the kotlinFile production (314 -> 315 -> 317 -> 320 -> 318 -> 316 -> 309 -> 308 -> 310 ->311 -> 312 -> 314). Additionally, there's a loop involving decision 2 with anysemi (311 -> 309 -> 308 -> 310 -> 311). This is terrible because there is a combinatorial explosion of choices between decision 2 and 3! (NB to myself: the Trash toolkit should try to find SCC's that involve the same symbol and epsilon transitions. When there are two cycles that share a common path through the graph, it should be flagged as disastrous.) The rule is written poorly.

graphviz (5)

The rule is:

kotlinFile
    : NL* preamble anysemi* (topLevelObject (anysemi+ topLevelObject?)*)? EOF
    ;

Here we see that we are taking the closure of a closure: (anysemi+ topLevelObject?)* => (anysemi+)*.

Changing the rule to kotlinFile : NL* preamble anysemi* (topLevelObject (anysemi+ topLevelObject)*)? EOF ; makes the parser only very modestly faster.

The real problem involves type:

07/03-09:12:35 ~/issues/g4-3539/kotlin/kotlin/Generated-CSharp
$ trperf examples/Test.kt | tail -n +2 | sort -k4 -n -r | column -t
Time to parse: 00:00:02.0384441
193  type                        370  4.643815  3696  2434  8    0   0  1163
52   delegationSpecifier         30   1.569024  721   488   21   8   0  639
268  postfixUnaryExpression      525  1.226679  1347  79    107  22  0  500
211  simpleUserType              390  0.759468  635   25    12   0   0  118
192  type                        370  0.728245  507   32    0    0   0  122
233  statement                   220  0.69018   546   17    37   37  0  260
...

Here, it cannot decide which of the 4 paths to choose at decision 193.

graphviz (6)

And here we gather why this is taking so long: the max-k is 2434. In other words, it needs to read 2434 tokens in order to make a decision. This is terrible.

Looking over the rules for functionType, parenthesizedType, nullableType, and typeReference, we see why. There are two conflicting derivations for type: type => nullableType => typeReference and type => typeReference. This comes directly from the official grammar for Kotlin.

It has always been my assertion that the reason why people claim that hand-written parsers are faster than generated parsers is because people are scraping grammars without including critical semantic predicates, or there are legitimate errors in the grammar in reading the parser. I cannot say which.

There is an official grammar in the kotlin-spec repo. https://github.com/Kotlin/kotlin-spec/tree/release/grammar/src/main/antlr. Unfortunately, it doesn't even parse Test.kt. But, it too contains the type production problem.

It looks like the grammar doesn't even correspond to the parser code. Here we see in comments "type" completely different from the official grammar and the parser code. You just cannot trust people to faithfully reverse engineer a parser back to a grammar.

Thanks for the reply.

FWIW, I decided to do per-thread caching of the ANTLR shared caches. That's working nicely and balancing both the speedup obtained by warming up the parser with removing the sharing bottlenecks that we were experiencing previously.

Responding to your questions.

We're using the official Kotlin grammar: https://kotlinlang.org/docs/reference/grammar.html. As I mentioned on the issue you opened regarding the Kotlin grammar, we've made a few tweaks to it to correct obvious mistakes. I'll see if it can parse the test file you mention, and use that to run down any other lingering issues.
We're using the Java ANTLR runtime. I was referencing this optimized fork: https://github.com/tunnelvisionlabs/antlr4. IIRC the documentation indicated that it only supported Java and a few other ANTLR targets. And it did seem to work, although it caused too much breakage with other libraries that utilize more recent version of ANTLR for me to integrate fully, and so I was never able to test it on our production workload and observe its scaling behavior.

Overall I agree that the Kotlin grammar performance is both poor and a contributing cause of the scaling problems we're experiencing. I'd be happy to work on improvements, but in the short term I'm happy to just get our tool that uses the slow but correct-ish parser to run a bit faster.

antlr / antlr4

Concurrent Parsing Bottlenecks #4344

Questions:

Observations: