Grammar that parses in milliseconds in Java runtime, takes many seconds in python

pavelvelikhov commented 8 years ago

I have a grammar of Python3 (from the antlr4 grammar repository), extended with query language constructs. The grammar file is here: Grammar file

The whole project is here: PythonQL

This tiny program parses in milliseconds with the Java runtime, but takes about 1.5 seconds in python (after the recent fix, before it was over 2 seconds).

# This example illustrates the window query in PythonQL

from collections import namedtuple
trade = namedtuple('Trade', ['day','ammount', 'stock_id'])

trades = [ trade(1, 15.34, 'APPL'),
           trade(2, 13.45, 'APPL'),
           trade(3, 8.34,  'APPL'),
           trade(4, 9.87,  'APPL'),
           trade(5, 10.99, 'APPL'),
           trade(6, 76.16, 'APPL') ]

# Maximum 3-day sum

res = (select win
        for sliding window win in ( select t.ammount for t in trades )
        start at s when True
        only end at e when (e-s == 2))

print (res)

Here is a profiler trace just in case (I left only the relevant entries):

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    21    0.000    0.000    0.094    0.004 PythonQLParser.py:7483(argument)
    8    0.000    0.000    0.195    0.024 PythonQLParser.py:7379(arglist)  
    9    0.000    0.000    0.196    0.022 PythonQLParser.py:6836(trailer)  
    5/3    0.000    0.000    0.132   0.044 PythonQLParser.py:6765(testlist_comp)
    1    0.000    0.000    0.012    0.012 PythonQLParser.py:6154(window_end_cond)
    1    0.000    0.000    0.057    0.057 PythonQLParser.py:6058(sliding_window)
    1    0.000    0.000    0.057    0.057 PythonQLParser.py:5941(window_clause)
    1   0.000    0.000    0.004   0.004 PythonQLParser.py:5807(for_clause_entry)
    1    0.000    0.000    0.020 0.020 PythonQLParser.py:5752(for_clause) 
    2/1   0.000   0.000   0.068   0.068 PythonQLParser.py:5553(query_expression)    
   48/10    0.000    0.000    0.133    0.013 PythonQLParser.py:5370(atom) 
    48/7    0.000    0.000    0.315    0.045 PythonQLParser.py:5283(power) 
    48/7    0.000    0.000    0.315    0.045 PythonQLParser.py:5212(factor) 
    48/7    0.000    0.000    0.331    0.047 PythonQLParser.py:5132(term) 
    47/7    0.000    0.000    0.346    0.049 PythonQLParser.py:5071(arith_expr) 
    47/7    0.000    0.000    0.361    0.052 PythonQLParser.py:5010(shift_expr) 
    47/7    0.000    0.000    0.376    0.054 PythonQLParser.py:4962(and_expr) 
    47/7    0.000    0.000    0.390    0.056 PythonQLParser.py:4914(xor_expr) 
    47/7    0.000    0.000    0.405    0.058 PythonQLParser.py:4866(expr) 
    44/7    0.000    0.000    0.405    0.058 PythonQLParser.py:4823(star_expr) 
    43/7    0.000    0.000    0.422    0.060 PythonQLParser.py:4615(not_test) 
    43/7    0.000    0.000    0.438    0.063 PythonQLParser.py:4563(and_test) 
    43/7    0.000    0.000    0.453    0.065 PythonQLParser.py:4509(or_test) 
    43/7    0.000    0.000    0.467    0.067 PythonQLParser.py:4293(old_test) 
    43/7    0.000    0.000    0.467    0.067 PythonQLParser.py:4179(try_catch_expr)
    43/7    0.000    0.000    0.482    0.069 PythonQLParser.py:3978(test) 
    1    0.000    0.000    0.048    0.048 PythonQLParser.py:2793(import_from) 
    1    0.000    0.000    0.048    0.048 PythonQLParser.py:2702(import_stmt) 
    7    0.000    0.000    1.728    0.247 PythonQLParser.py:2251(testlist_star_expr) 
    4    0.000    0.000    1.770    0.443 PythonQLParser.py:2161(expr_stmt) 
    5    0.000    0.000    1.822    0.364 PythonQLParser.py:2063(small_stmt) 
    5    0.000    0.000    1.855    0.371 PythonQLParser.py:1980(simple_stmt) 
    5    0.000    0.000    1.859    0.372 PythonQLParser.py:1930(stmt) 
    1    0.000    0.000    1.898    1.898 PythonQLParser.py:1085(file_input)
    176    0.002    0.000    0.993    0.006 Lexer.py:127(nextToken)
    420    0.000   0.000   0.535   0.001 ParserATNSimulator.py:1120(closure)
   705    0.003    0.000    1.642    0.002 ParserATNSimulator.py:315(adaptivePredict)

I have attached a file that parses for 7 seconds on my Macbook Pro as well.

I'd be happy to reduce this case to a minimal case for debugging, but don't really know where to start.

The grammar doesn't seem to have any problems like ambiguity, etc.

SimonSntPeter commented 5 years ago

@sharwell I didn't know that, and thanks, but the grammar here is an SQLite grammar, kindly bundled to me by @ricardojuanpalmaduran (see https://github.com/antlr/antlr4/issues/1219#issuecomment-452044969 above), and the only non-antlr stuff it has is some error reporting, not any predicates AFAICS. I just commented that error bit out and it runs the same. Playing with the grammar, I can cause it to run faster by deleting bits but I just can't see why. It doesn't act like a handwritten RD parser might be expected, and its behaviour is opaque to me, so I'm throwing in the towel at this level and will try to understand the code it outputs. I'd appreciate some help understanding bits of the ALL(*) paper if that's possible? Basic stuff. FYI per the python profiler, for a run as mentioned above, of the full 3.6 secs, adaptivePredict is called 5991 times and takes about 3 seconds. That's after my 'patch'; before that it was 14.9 secs overall and adaptivePredict had 5992 calls for 14.4 secs just for that func.

sharwell commented 5 years ago

@SimonSntPeter Thanks, that makes sense. The number of calls to adaptivePredict isn't cause for concern. This is an extremely heavily-used method, even in DFA scenarios. Can you break down the 14.4 seconds by which method(s) adaptivePredict calls?

SimonSntPeter commented 5 years ago

Sure. The calls seem to be, for the 14.9 sec original, where cumtime is cumulative time in that function and all function called, per python profiler:

adaptivePredict (5,992 calls, cumtime 14.4 secs) -- which calls -- execATN (5,992 calls, cumtime 14.3 secs) -- which calls -- execATNWithFullContext (1,431 calls, cumtime12.6 secs) -- which calls -- closure (24,248 calls, cumtime 12.1 secs) -- which calls -- closureChckingStopState (1,445,631 calls (24,248 of these recursive), cumtime 12.04 secs) -- and also -- closure_ (1,378,886 calls (24,248 recursive), cumtime 12.01 secs)

The sequence of calls are what I found tracing through manually.

There's also the following but I don't know where they fit in: computeReachSet (7,459 calls, cumtime 11.8 secs) getEpsilonTarget (1,817,881 calls, cumtime 4.2 secs) computeStartState (468 calls (364 recursive), cumtime 1.1 secs) ruleTransition (199,361 calls, cumtime 1.1 secs) hashCodeForConfigSet (483,798 calls, cumtime 0.8 secs) PredictionContext.merge (90,318 calls (82783 recursive), cumtime 0.74 secs) PredictionContext.create(209,286 calls, cumtime 0.7 secs) PredictionMode.getConflictingAltSubsets (7,422 calls, cumtime 0.55 secs)

After that what's not grammar rules is mostly lexer overheads.

I'm not familiar with python profiler so if anything looks not credible, shout and I'll double-check it. I'd trust the call count a bit more than the times reported.

On UK time so will pick up tomorrow.

SimonSntPeter commented 5 years ago

NB if you look at @ricardojuanpalmaduran's bundled code, the SQL being parsed is basically 2 things, a create table statement and a create view statement, both of these copied&pasted a dozen times, so once you've parsed one of each the rest should be trivial regarding adaptive stuff ( = lookups straight from the DFA cache?)

amykyta3 commented 4 years ago

Hey everyone. Posting here since it may be useful to people that find this issue thread.

@parrt and @thisiscam's comments about using cython inspired me to put together an accelerator module for Antlr. It isn't implemented using cython, but it uses the Antlr C++ target as a Python extension. Lexing & parsing is done exclusively in C++, and then an auto-generated visitor is used to re-build the resulting parse tree in Python. Initial tests show a 5x-25x speedup depending on the grammar and input, and I have a few ideas on how to improve it further.

Here is the code-generator tool: https://github.com/amykyta3/speedy-antlr-tool And here is a fully-functional example: https://github.com/amykyta3/speedy-antlr-example

It is still a work in progress, and I hope to fill in more documentation details this weekend.

JameelNabbo commented 4 years ago

Is this issue sorted?

parrt commented 4 years ago

@JameelNabbo nope, not as far as i know.

data-harmonization commented 3 years ago

I have a similar behaviour with a query of mine. Unfortunately, I'm not allowed to share my g4-file, but I hope the following explanation will help to pinpoint where the issue is likely to be.

Bahaviour My input file has one statetement of 1246 lines and 9642 tokens that I use as a test case. When I paste the command the ANTLR Preview in PyCharm (ANTLR plugin), I have results in less than 10 seconds (I assume the Java parser is used here). When I read the same command using the ANTLR BufferedTokenStream and use the Python3 ANTLR-parser, the following RecursionError is given after roughly 5 minutes.

    self.closure_(config, configs, closureBusy, collectPredicates, fullCtx, depth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1187: in closure_
    self.closureCheckingStopState(c, configs, closureBusy, continueCollecting, fullCtx, newDepth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1136: in closureCheckingStopState
    self.closure_(config, configs, closureBusy, collectPredicates, fullCtx, depth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1187: in closure_
    self.closureCheckingStopState(c, configs, closureBusy, continueCollecting, fullCtx, newDepth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1136: in closureCheckingStopState
    self.closure_(config, configs, closureBusy, collectPredicates, fullCtx, depth, treatEofAsEpsilon)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ParserATNSimulator.py:1143: in closure_
    configs.add(config, self.mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/atn/ATNConfigSet.py:91: in add
    merged = merge(existing.context, config.context, rootIsWildcard, mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:263: in merge
    return mergeSingletons(a, b, rootIsWildcard, mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:324: in mergeSingletons
    parent = merge(a.parentCtx, b.parentCtx, rootIsWildcard, mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:278: in merge
    return mergeArrays(a, b, rootIsWildcard, mergeCache)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:512: in mergeArrays
    merged = ArrayPredictionContext(mergedParents, mergedReturnStates)
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:184: in __init__
    super().__init__(calculateListsHashCode(parents, returnStates))
/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:77: in calculateListsHashCode
    h = hash((h, calculateHashCode(parent, returnState)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

parent = <antlr4.PredictionContext.ArrayPredictionContext object at 0x132697af0>
returnState = 1124

    def calculateHashCode(parent:PredictionContext, returnState:int):
>       return hash("") if parent is None else hash((hash(parent), returnState))
E       RecursionError: maximum recursion depth exceeded

/Users/p.stapersma/.pyenv/versions/3.8.7/lib/python3.8/site-packages/antlr4/PredictionContext.py:72: RecursionError

Source of the issue While there are only 9642 tokens, the Python parser reads a total of 341.253 tokens. A visualization of this pattern is found in this graph:

https://acheron.cloud/wp-content/uploads/2021/07/antlr_python_graph.png

In particular, the function "execATN" in ParserATNSimulation.py was going over-and-over the tokens.

Can this information help with the issue why the ANTLR-Python-parser is slower than its Java brother?

parrt commented 3 years ago

Hi. Yep, unfortunately Python is not particularly efficient with recursion and does not allow a great deal of depth. Maybe increase it? https://docs.python.org/3/library/sys.html#sys.setrecursionlimit

data-harmonization commented 3 years ago

Thank you for the suggestion! I will certainly have a look at this.

I'm still a bit flabbergasted by the number of tokens Antlr wishes to visit in the function "execATN". I'm wondering why and if the Java-build requires the same number of tokens to be visited. If I have some time the coming weekend, I will have a look at this.

parrt commented 3 years ago

Yep, lookahead can be HUUUUGE. it's the nature of ALL(*).

data-harmonization commented 3 years ago

Thank you for the clarification! That explains the high number. Makes a lot more sense now!

KvanTTT commented 2 years ago

Performance can be improved for Python using constant folding during generation especially for large grammars, see https://github.com/antlr/antlr4/issues/3698

SimonStPeter commented 2 years ago

Any idea how much it would improve a real grammar if implemented?

BTW perhaps easier than folding constant exprs it would be just to hoist them into vars evaluated once:

def no_folding_const_refs_test(): return (1 << C1) | (1 << C2) | (1 << C3) | (1 << C4) | (1 << C5) | (1 << C6) | \ (1 << C7) | (1 << C8) | (1 << C9) | (1 << C10) | (1 << C11) | (1 << C12) | \ (1 << C13) | (1 << C14) | (1 << C15) | (1 << C16)

instead

var _evaluated_713 = (1 << C1) | (1 << C2) | (1 << C3) | (1 << C4) | (1 << C5) | (1 << C6) | \ (1 << C7) | (1 << C8) | (1 << C9) | (1 << C10) | (1 << C11) | (1 << C12) | \ (1 << C13) | (1 << C14) | (1 << C15) | (1 << C16) ... ... ... def no_folding_const_refs_test(): return _evaluated_713

folding is easy but maybe a bit messy, hoisting should be much easier with the same effect, I guess. Would it help here?

jan

On 07/05/2022 14:22, Ivan Kochurkin wrote:

Performance can be improved for Python especially for large grammars, see #3698 https://github.com/antlr/antlr4/issues/3698

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1219#issuecomment-1120209159, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNF57MGZCBMGHJAQ3HGXXTVIZVCBANCNFSM4CHUUOKA. You are receiving this because you commented.Message ID: @.***>

KvanTTT commented 2 years ago

Any idea how much it would improve a real grammar if implemented?

I don't how much but anyway it should. It looks like the code is located in quite hot path. The simplest case:

grammar Test;
e: T1 | T2 | T3;
T1: 't1';
T2: 't2';
T3: 't3';

The more alternatives rule have the more operations are required. For instance, SQL grammars contain a lot of alternatives.

BTW perhaps easier than folding constant exprs it would be just to hoist them into vars evaluated once:

Vars also should be stored somewhere. Also, vars not so efficient as const literals because Python has no idea about constants.

folding is easy but maybe a bit messy, hoisting should be much easier with the same effect, I guess. Would it help here?

It looks like any option is unclear for end user because of bitwise operations. In this case the most effective implementation should be choosed (with folded constants).

SimonStPeter commented 2 years ago

as below

On 07/05/2022 15:17, Ivan Kochurkin wrote:

Any idea how much it would improve a real grammar if implemented?
I wrote in detail in the reference issue. The simplest case:

grammar Test; e:T1 |T2 |T3; T1:'t1'; T2:'t2'; T3:'t3';

The more alternatives rule have the more operations are required.

I mean, unless I missed it, might it be a 5% speedup or 20 X speedup in a real grammar?

BTW perhaps easier than folding constant exprs it would be just to
hoist them into vars evaluated once:
Vars also should be stored somewhere. Also, vars not so efficient as const literals because Python has no idea about constants.

Global vars. Stick them at the top. The efficiency is slightly less I agree but hoisting const exprs into vars is very easy to do, so very easy to try out I hope.

Nice work though.

jan

folding is easy but maybe a bit messy, hoisting should be much
easier with the same effect, I guess. Would it help here?
It looks like any option is unclear for end user because of bitwise operations. In this case the most effective implementation should be choosed (with folded constants).

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1219#issuecomment-1120217783, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNF57JRF22WDSCNP77YQ7LVIZ3PBANCNFSM4CHUUOKA. You are receiving this because you commented.Message ID: @.***>

KvanTTT commented 2 years ago

I mean, unless I missed it, might it be a 5% speedup or 20 X speedup in a real grammar?

I'll check on a real grammar after implementation.

Global vars. Stick them at the top. The efficiency is slightly less I agree but hoisting const exprs into vars is very easy to do, so very easy to try out I hope.

Actually implementation with constant folding is also very easy, may be even easier than vars.

parrt commented 2 years ago

Doesn't the python compiler do any kind of optimization?

parrt commented 2 years ago

oh, what if we started generating cython?

KvanTTT commented 2 years ago

Doesn't the python compiler do any kind of optimization?

I haven't checked, but not all users use cpython. Also, not sure cpython optimized such cases because Python as language doesn't have conception of constant, interpreter or compiler just has no idea if a var will be changed. If cypython has constants than the optimization is possible.

oh, what if we started generating cython?

I don't know how many changes are required for compatibility with cython and if it's possible to use such library from ordinary Python. Actually I know almost nothing about cython.

pinaraf commented 2 years ago

Doesn't the python compiler do any kind of optimization?

Where we go, there is no compiler… :)

The only «compiler» step is the conversion of source code to bytecode, and the bytecode keeps everything, there is almost no optimization possible. In the previously mentioned no_folding_const_literals_test function, CPython 3.10 will optimize it as return 131070. Otherwise, since the interpreter doesn't know what is a constant and what is variable, it cannot optimize away the variable dereferences, and thus the no_folding_const_refs_test has a wooping 128 bytes body (instead of 4), and does for each part LOAD_CONST, LOAD_GLOBAL, BINARY_LSHIFT. LOAD_GLOBAL will have to go through the global dictionnary, thus an «expensive» call. Maybe the "faster cpython" project will help later, cf https://lwn.net/SubscriberLink/893686/c047ba1f808dd27d/ for the latest report on this. While waiting for this, well… folding, replacing consts literals, or using cython are some of the only options.

KvanTTT commented 2 years ago

Yet another potential performance improvemet: https://github.com/antlr/antlr4/issues/3703

KvanTTT commented 2 years ago

I see a lot of

if token in [PythonQLParser.NOT, PythonQLParser.NONE, PythonQLParser.TRUE, PythonQLParser.FALSE, PythonQLParser.NAME, PythonQLParser.STRING_LITERAL, PythonQLParser.BYTES_LITERAL, PythonQLParser.DECIMAL_INTEGER, PythonQLParser.OCT_INTEGER, PythonQLParser.HEX_INTEGER, PythonQLParser.BIN_INTEGER, PythonQLParser.FLOAT_NUMBER, PythonQLParser.IMAG_NUMBER, PythonQLParser.ELLIPSIS, PythonQLParser.STAR, PythonQLParser.OPEN_PAREN, PythonQLParser.OPEN_BRACK, PythonQLParser.ADD, PythonQLParser.MINUS, PythonQLParser.NOT_OP, PythonQLParser.OPEN_BRACE]:

and

if ((((_la - 41)) & ~0x3f) == 0 and ((1 << (_la - 41)) & ((1 << (PythonQLParser.LAMBDA - 41)) | (1 << (PythonQLParser.NOT - 41)) | (1 << (PythonQLParser.NONE - 41)) | (1 << (PythonQLParser.TRUE - 41)) | (1 << (PythonQLParser.FALSE - 41)) | (1 << (PythonQLParser.NAME - 41)) | (1 << (PythonQLParser.STRING_LITERAL - 41)) | (1 << (PythonQLParser.BYTES_LITERAL - 41)) | (1 << (PythonQLParser.DECIMAL_INTEGER - 41)) | (1 << (PythonQLParser.OCT_INTEGER - 41)) | (1 << (PythonQLParser.HEX_INTEGER - 41)) | (1 << (PythonQLParser.BIN_INTEGER - 41)) | (1 << (PythonQLParser.FLOAT_NUMBER - 41)) | (1 << (PythonQLParser.IMAG_NUMBER - 41)) | (1 << (PythonQLParser.ELLIPSIS - 41)) | (1 << (PythonQLParser.STAR - 41)) | (1 << (PythonQLParser.OPEN_PAREN - 41)) | (1 << (PythonQLParser.OPEN_BRACK - 41)) | (1 << (PythonQLParser.ADD - 41)) | (1 << (PythonQLParser.MINUS - 41)) | (1 << (PythonQLParser.NOT_OP - 41)) | (1 << (PythonQLParser.OPEN_BRACE - 41)))) != 0):

in generated PythonQLParser.py by @pavelvelikhov. I believe suggested improvements might improve parser performance (but it looks like author already doesn't use ANTLR).

amykyta3 commented 2 years ago

@KvanTTT Ive put together a post-processing script that attempts optimizing the constant shift operations you pointed out in #3698 See: https://gist.github.com/amykyta3/8285559e95074c4431c2836d78b36530 This script patches a *Parser.py file and replaces these constant expressions with literals.

I tried it on an existing grammar on one of my projects and ~~didn't see a huge improvement~~ see about a 2% speed improvement, but it also doesn't branch tokens all too deeply. I wonder if other grammars would benefit from this more.

EDIT: Added the optimization described in #3703 to the above gist as well

amykyta3 commented 2 years ago

Heres the diff the patch script produces: https://github.com/SystemRDL/systemrdl-compiler/commit/c4d425e8216f86c2d2572ed61803b46c48261045

Edited above comment - I had initially botched my benchmark. Now am seeing ~2% improvement in overall parse speed after implementing both optimizations

KvanTTT commented 2 years ago

I wonder if other grammars would benefit from this more.

Not sure because they have constants that can be effectively folded during compilation. Also they have switch case unlike Python.

Edited above comment - I had initially botched my benchmark. Now am seeing ~2% improvement in overall parse speed after implementing both optimizations

Not impressive. Maybe it's better for other grammars or input files. Anyway, I consider it's useful optimization since it also reduces the size of generated code and eliminates very long lines that are confusing.

KvanTTT commented 2 years ago

Heres the diff the patch script produces: https://github.com/SystemRDL/systemrdl-compiler/commit/c4d425e8216f86c2d2572ed61803b46c48261045

Have you tried array of literals [0, 1, 2, 3] instead of (1 << token) & 0x20000000f00000003ffffff000000000 != 0:. Not sure it's the best implementation since it looks like big numbers (>64 bit) are used.

antlr / antlr4

Grammar that parses in milliseconds in Java runtime, takes many seconds in python #1219