Grammar that parses in milliseconds in Java runtime, takes many seconds in python

pavelvelikhov commented 8 years ago

I have a grammar of Python3 (from the antlr4 grammar repository), extended with query language constructs. The grammar file is here: Grammar file

The whole project is here: PythonQL

This tiny program parses in milliseconds with the Java runtime, but takes about 1.5 seconds in python (after the recent fix, before it was over 2 seconds).

# This example illustrates the window query in PythonQL

from collections import namedtuple
trade = namedtuple('Trade', ['day','ammount', 'stock_id'])

trades = [ trade(1, 15.34, 'APPL'),
           trade(2, 13.45, 'APPL'),
           trade(3, 8.34,  'APPL'),
           trade(4, 9.87,  'APPL'),
           trade(5, 10.99, 'APPL'),
           trade(6, 76.16, 'APPL') ]

# Maximum 3-day sum

res = (select win
        for sliding window win in ( select t.ammount for t in trades )
        start at s when True
        only end at e when (e-s == 2))

print (res)

Here is a profiler trace just in case (I left only the relevant entries):

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    21    0.000    0.000    0.094    0.004 PythonQLParser.py:7483(argument)
    8    0.000    0.000    0.195    0.024 PythonQLParser.py:7379(arglist)  
    9    0.000    0.000    0.196    0.022 PythonQLParser.py:6836(trailer)  
    5/3    0.000    0.000    0.132   0.044 PythonQLParser.py:6765(testlist_comp)
    1    0.000    0.000    0.012    0.012 PythonQLParser.py:6154(window_end_cond)
    1    0.000    0.000    0.057    0.057 PythonQLParser.py:6058(sliding_window)
    1    0.000    0.000    0.057    0.057 PythonQLParser.py:5941(window_clause)
    1   0.000    0.000    0.004   0.004 PythonQLParser.py:5807(for_clause_entry)
    1    0.000    0.000    0.020 0.020 PythonQLParser.py:5752(for_clause) 
    2/1   0.000   0.000   0.068   0.068 PythonQLParser.py:5553(query_expression)    
   48/10    0.000    0.000    0.133    0.013 PythonQLParser.py:5370(atom) 
    48/7    0.000    0.000    0.315    0.045 PythonQLParser.py:5283(power) 
    48/7    0.000    0.000    0.315    0.045 PythonQLParser.py:5212(factor) 
    48/7    0.000    0.000    0.331    0.047 PythonQLParser.py:5132(term) 
    47/7    0.000    0.000    0.346    0.049 PythonQLParser.py:5071(arith_expr) 
    47/7    0.000    0.000    0.361    0.052 PythonQLParser.py:5010(shift_expr) 
    47/7    0.000    0.000    0.376    0.054 PythonQLParser.py:4962(and_expr) 
    47/7    0.000    0.000    0.390    0.056 PythonQLParser.py:4914(xor_expr) 
    47/7    0.000    0.000    0.405    0.058 PythonQLParser.py:4866(expr) 
    44/7    0.000    0.000    0.405    0.058 PythonQLParser.py:4823(star_expr) 
    43/7    0.000    0.000    0.422    0.060 PythonQLParser.py:4615(not_test) 
    43/7    0.000    0.000    0.438    0.063 PythonQLParser.py:4563(and_test) 
    43/7    0.000    0.000    0.453    0.065 PythonQLParser.py:4509(or_test) 
    43/7    0.000    0.000    0.467    0.067 PythonQLParser.py:4293(old_test) 
    43/7    0.000    0.000    0.467    0.067 PythonQLParser.py:4179(try_catch_expr)
    43/7    0.000    0.000    0.482    0.069 PythonQLParser.py:3978(test) 
    1    0.000    0.000    0.048    0.048 PythonQLParser.py:2793(import_from) 
    1    0.000    0.000    0.048    0.048 PythonQLParser.py:2702(import_stmt) 
    7    0.000    0.000    1.728    0.247 PythonQLParser.py:2251(testlist_star_expr) 
    4    0.000    0.000    1.770    0.443 PythonQLParser.py:2161(expr_stmt) 
    5    0.000    0.000    1.822    0.364 PythonQLParser.py:2063(small_stmt) 
    5    0.000    0.000    1.855    0.371 PythonQLParser.py:1980(simple_stmt) 
    5    0.000    0.000    1.859    0.372 PythonQLParser.py:1930(stmt) 
    1    0.000    0.000    1.898    1.898 PythonQLParser.py:1085(file_input)
    176    0.002    0.000    0.993    0.006 Lexer.py:127(nextToken)
    420    0.000   0.000   0.535   0.001 ParserATNSimulator.py:1120(closure)
   705    0.003    0.000    1.642    0.002 ParserATNSimulator.py:315(adaptivePredict)

I have attached a file that parses for 7 seconds on my Macbook Pro as well.

I'd be happy to reduce this case to a minimal case for debugging, but don't really know where to start.

The grammar doesn't seem to have any problems like ambiguity, etc.

pavelvelikhov commented 8 years ago

This program takes 5 seconds to parse


import csv
import sys
from collections import namedtuple
import json
import numpy as np
from dateutil.parser import parse
from pythonql.PQTuple import pq_wrap,pq_flatten

fields = ['', 'event_date', 'event_name', 'client_id', 'lead_id', 'visitor_id', 'partner_id', 'address', 'amount', 'birthdate', 'card_type', 'cardholder_name', 'city', 'credit_limit', 'current_duration', 'domain', 'duration', 'effective_amount', 'email', 'first_name', 'gender', 'interest_rate', 'last_name', 'lead_source', 'middle_name', 'mobile_head', 'mobile_number', 'new_amount', 'new_duration', 'new_interest_rate', 'pan_tail', 'query', 'referer', 'region', 'repaid_amount', 'repaid_interest', 'user_agent', 'utm_campaign', 'webmaster_id', 'reason', 'dt']

schema = {f:i for (i,f) in enumerate(fields)}

out_fields = ['event','date']
out_schema = {f:i for (i,f) in enumerate(out_fields)}

csv.field_size_limit(sys.maxsize)

data = []
f = open('sm_cust_journey.csv')
rd = csv.reader(f)
for (i, (cust_id,cj_data)) in enumerate(rd):
  steps = json.loads(cj_data)
  data.append( {'id':cust_id, 'cj': steps } )
  if i==1000:
    break

# compute number of new users, registrations, scorecard checks, accepts, issued, repaid, npl 1, 30, 60, 90
events =( select [
           select 'reg' as event, parse(e.event_date) as dt, channel
           for e in [item for item in cj_data if item.event_name=="NAME_SAVED"]
           count c1
           where c1 == 0,

           select 'repaid' as event, parse(e2.event_date) as dt, channel
           for e in [item for item in cj_data if item.event_name=="LOAN_IS_REPAID"]
           count c1
           where c1 == 0
           for e2 in [item for item in cj_data if item.event_name=="LOAN_ISSUED"]
           count c2
           where c2 == 0,

           select 'np1' as event, parse(e.event_date) as dt, channel
           for e in [item for item in cj_data if item.event_name=="LOAN_ISSUED"]
           count c1
           where c1 == 0
           let issue_date = parse(e.event_date)
           where not (
                        select item
                        for item in cj_data
            where item.event_name=="LOAN_IS_REPAID" and (parse(item.event_date) - issue_date ).days < 31 )
   ]
   for cj in data
   let cj_data = list( pq_wrap(cj["cj"], schema) )
   let channel = try {[item for item in cj_data if item.event_name=="NEW_USER_DETECTED" ][0].reason} except {None}
   where not channel is None )

res = (select month, year, channel, event, len(e) as cnt
       for e in pq_flatten(events)
       let event = e.event
       group by e.dt.month as month, e.dt.year as year, e.channel as channel, event
       order by year,month)

print(res)

ericvergnaud commented 8 years ago

Hi Please check #1218, could be the same issue Eric

pavelvelikhov commented 8 years ago

Hi Eric,

I did look at #1218 (mentioned it as a recent fix), and checked out build where it was fixed. It did effect the performance a bit (20-30%), but still the very long parse times compared to Java are there. The numbers given are on the build from: Jun 23 08:54:27 2016 -0700

ericvergnaud commented 8 years ago

Hi, Not sure I can help here. A couple comments:

Python is 20 to 30 times slower than Java, so I'd say that if it takes 50ms in Java, 1.5s in Python is in range
if your goal is to run this in Python only, than you should optimize the grammar for Python, not for Java
you might find it useful to isolate which piece of the above sample is 'slow'. The idea that a complex grammar will just perform globally is legitimate but unrealistic in Python or Javascript.

Eric

Envoyé de mon iPhone

Le 23 juin 2016 à 21:32, pavelvelikhov notifications@github.com a écrit :

5132

pavelvelikhov commented 8 years ago

Hi Eric,

Yes, the goal is to parse in Python only. Could you recommend some techniques to optimize the grammar for Python runtime? Yeah, isolating a slow part would be helpful, I know. Its just a lot of work and I don't see a clear strategy of how to do it.

Thanks! Pavel

SimonStPeter commented 8 years ago

I don't buy it. ANTLR is supposed to produce RD parsers comparable in efficiency to hand written ones. Python is slow but it's not that flipping slow. Something is not right here. I know I could manually write one that would beat this hands down.

I don't think it's the parser, I suspect python's malfunctioning. I happen to have generated code left over from looking at this Q on stackoverflow. My first suspicion is on the fat and mostly constant expressions like this

AFAICS the only variables in there is _la. Machine generated stuff is famous for breaking compilers/interpreters and I think this may be a possibility.

2nd thought is garbage collection. I'd like to know how much GC is being done.

My suggestion is yes, this may be an antlr issue indirectly but my primary concern would be python. I suggest you take this up in a python newsgroup and ask for guys with python profiling and python internals experience to have a look.

Sorry I can't suggest anything else, very busy ATM.

jimidle commented 8 years ago

If you need speed, then you should consider changing targets. You are flogging a dead horse trying to get performance from Python. If you just need it to be a little faster, then buy all means pursue.

Jim

On Jun 25, 2016, at 02:54, pavelvelikhov notifications@github.com wrote:

Hi Eric,

Yes, the goal is to parse in Python only. Could you recommend some techniques to optimize the grammar for Python runtime? Yeah, isolating a slow part would be helpful, I know. Its just a lot of work and I don't see a clear strategy of how to do it.

Thanks! Pavel

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

SimonStPeter commented 8 years ago

@jimidle: to say that is to say that the python target in ANTLR is functionally unusable, in which case it should be deprecated ASAP to stop other people from going down that blind alley. However, I repeat, I don't believe it. Python is slow but not that slow. Nowhere near. Something else is wrong. I'd love to dig into it but I'm starting a new job tomorrow and I'm going to be snowed under for the foreseeable future. @pavelvelikhov: please see my suggestion about taking it up with pthon experts.

ericvergnaud commented 8 years ago

Hi, Many people use Python as a wrapper around c++ libraries, so they get the feeling that it's not that slow. Pure Python is 20-30 times slower than Java, not my say, but what benchmarks say. As we've seen recently, there is a possibility that the Python runtimes don't behave like the Java runtime in terms of performance, due to bugs which prevent algorithmic optimizations. Add to that the fact that the current optimizations were tuned for Java and C#, there is also a possibility that they hit slow parts of Python. It will just take time to identify specific patterns that need further Python specific implementations. That said, there are plenty of good use cases for a not so fast Python parser. People who choose Python in the first place are not looking at performance as a key driver. So no reason to deprecate it. Eric

Envoyé de mon iPhone

Le 26 juin 2016 à 17:51, SimonStPeter notifications@github.com a écrit :

@jimidle: to say that is to say that the python target in ANTLR is functionally unusable, in which case it should be deprecated ASAP to stop other people from going down that blind alley. However, I repeat, I don't believe it. Python is slow but not that slow. Nowhere near. Something else is wrong. I'd love to dig into it but I'm starting a new job tomorrow and I'm going to be snowed under for the foreseeable future. @pavelvelikhov: please see my suggestion about taking it up with pthon experts.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jimidle commented 8 years ago

As Eric says, getting a target to work perfectly does take time. But even if it is now perfect it will not perform like the other targets. Lots of good uses for Pythonand many tasks are not driven by performance, though I feel that Python will get better in this regard as time goes on.

On Mon, Jun 27, 2016 at 12:32 AM +0800, "ericvergnaud" notifications@github.com wrote:

Hi,

Many people use Python as a wrapper around c++ libraries, so they get the feeling that it's not that slow.

Pure Python is 20-30 times slower than Java, not my say, but what benchmarks say.

As we've seen recently, there is a possibility that the Python runtimes don't behave like the Java runtime in terms of performance, due to bugs which prevent algorithmic optimizations.

Add to that the fact that the current optimizations were tuned for Java and C#, there is also a possibility that they hit slow parts of Python.

It will just take time to identify specific patterns that need further Python specific implementations.

That said, there are plenty of good use cases for a not so fast Python parser. People who choose Python in the first place are not looking at performance as a key driver.

So no reason to deprecate it.

Eric

Envoyé de mon iPhone

Le 26 juin 2016 à 17:51, SimonStPeter notifications@github.com a écrit :

@jimidle: to say that is to say that the python target in ANTLR is functionally unusable, in which case it should be deprecated ASAP to stop other people from going down that blind alley.

However, I repeat, I don't believe it. Python is slow but not that slow. Nowhere near. Something else is wrong.

I'd love to dig into it but I'm starting a new job tomorrow and I'm going to be snowed under for the foreseeable future.

@pavelvelikhov: please see my suggestion about taking it up with pthon experts.

—

You are receiving this because you commented.

Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

pinaraf commented 8 years ago

Hi

Do you have a simple archive to download that would contain everything needed to reproduce that issue ? In order to discard the «python is slow» argument, I suggest you also try with PyPy. After a few (~100) iterations, once the JIT triggers, if performance is not significantly better, then there may be something wrong in the antlr runtime or generated code. But I agree with SimonStPeter, this should not be a reason for that performance problem, otherwise a bug report should be done against the CPython interpreter.

pavelvelikhov commented 8 years ago

Hi!

You can grab the whole thing from the git repository: https://github.com/pavelvelikhov/pythonql Don't have an archive, but git is pretty friendly.

I'll try a few things over the weekend (e.g. removing all the grammar added to the original Python3 grammar) and I'll add a launcher with a profiler.

pavelvelikhov commented 8 years ago

Hi!

I've taken out all my productions that were in addition to the Python3 grammar, which is on the official ANTLR4 grammar repository. The speed is about the same, so the original Python3 grammar parses very slow as well.

SimonStPeter commented 8 years ago

Regarding my prior thought about performance being caused perhaps by giant expressions, I stripped down one and iterated it 1000 times, updating _la in the loop to prevent result caching, and it took under a second so it's not that. I'm even more suspicious now about garbage & GC.

thisiscam commented 8 years ago

I encountering performance issues with ANTLR's python target also, and I've compared my grammar on cpython and pypy -- both were at least thousand times slower than java counterparts..

It drives me to the conclusion that it's probably still due to the inefficient translated code or the python runtime code.

Like one of the previous comments suggested, people in python community interested in speed usually uses C/C++ extensions. Considering that we've already have a C++ antlr target, having an ANTLR python target based on C++ extension is very practical and will be tons lot useful than the current python target! I can get hands on contributing when I get some time later, but right now this might be a good idea to anyone who might be interested in such an idea.

parrt commented 7 years ago

@pavelvelikhov hi! heh, can you check again due to recent fix? https://github.com/antlr/antlr4/pull/1441

pavelvelikhov commented 7 years ago

Hi Terence,

Will try this again soon!

Best regards, Pavel

On 11 Dec 2016, at 01:47, Terence Parr notifications@github.com wrote:

@pavelvelikhov https://github.com/pavelvelikhov hi! heh, can you check again due to recent fix? #1441 https://github.com/antlr/antlr4/pull/1441 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1219#issuecomment-266245326, or mute the thread https://github.com/notifications/unsubscribe-auth/AEJ74RLvVFAZQoCtSRDZ5bBlQeAHQTKPks5rGyv-gaJpZM4I9KOU.

millergarym commented 7 years ago

@pavelvelikhov take a look at the last comment in #1540. I would be interested to know if it's a similar issue.

parrt commented 7 years ago

@pavelvelikhov still an issue? I'd like to know if I can close.

pavelvelikhov commented 7 years ago

Hi Terrence, checked out an older version of pythonql, still a simple test takes ~7 sec to parse, the same test parses in about 100msec in PLY.

parrt commented 7 years ago

dang. ok, i'm leaving this open.

ricardojuanpalmaduran commented 6 years ago

Is there any progress on this? I'm currently using Antlr4 to parse some SQLite files (my SQLite grammar comes directly from the Anltr4 website), and it takes a few seconds to parse the simplest of files

ricardojuanpalmaduran commented 5 years ago

@ericvergnaud is there any plan to improve the performance of the Python parser? I'm curious why PLY would take 100ms on an example while antlr4 would take 7s, as pointed out by @pavelvelikhov. Such a big difference is difficult to attribute to Python only.

You recommended maybe tuning grammars to be more Python friendly. Is there any advice on how to do that? I'm currently using the SQLite grammar that you guys provide on the website and parsing some simple SQLite file with a few tables takes seconds.

Thanks.

ericvergnaud commented 5 years ago

Hi, it's easy to attribute the slowness to Python only because the exact same algorithm behaves much better with Java, C#, and even JavaScript! This is unfortunately proven by profilers, which show the exact same # of calls to core parts of the algorithm for a given grammar and input. Re PLY, a fundamental difference between ANTLR and other parser generators is support for left recursion, which lets developers express grammars in a natural way, as opposed to traditional parser generators. This comes with a cost: there is an infinite number of valid token sequences, which cannot therefore be precomputed (and the corresponding code pre-generated). Instead they are discovered and registered at runtime. This may seem illogical but actually works because every input is finite. Beyond the natural slowness of Python, there is the cost of the above unique algorithm, where simplicity of use is provided at the cost of a performance penalty. I can think of many ways to improve Python runtime performance, such as optionally dropping left-recursion and/or using more optimal data structures (no idea which ones...), but any of these would come at the cost of dropping cross-target compatibility, which is the very reason I provided those runtimes in the first place, so not keen to do this myself. More broadly, I would tend to argue that anybody looking for performance should not use Python in the first place, but I appreciate this is very defensive since I have not been able to improve performance at a level which competes with the Java version or PLY. Re how to improve a given grammar, I would simply profile the parser and count the # of calls to ParserATNSimulator.closure_, which is where parsing decisions are made. Reducing the # of alternatives for a given rule is the way to improve performance, which you can measure by counting the above calls (FYI, on all grammars and inputs I've tested the # of such calls is exactly the same for Java and all targets I have implemented).

ericvergnaud commented 5 years ago

Some simple potential optimisations for SQLite would be as follows:

separate lexer and parser so you can better identify reusable grammar fragments
drop the alpha fragments such as fragment: A : [aA];, and replace the keyword definitions accordingly: K_ABORT : A B O R T; should read K_ABORT : 'ABORT'; (you can support case insensitivity using a TokenStreamRewriter and converting IDENTIFIER if required)

ricardojuanpalmaduran commented 5 years ago

I've removed all the fragments and hardcoded the values of the keywords and it seems that it does not have a significant impact on the number of calls to closure_. To give some numbers, with the change the number of calls as reported by cProfile is '955807/13579' while without the change the number of calls is '1016393/13928' (that's a ~6% reduction).

Any other advice off the top of your head that might improve the performance of this grammar in Python?

As you pointed out the # calls to core methods of the parsing algorithm seems to be the same in both Java and Python. It seems that unless there's a significant improvement of the underlying parsing algorithm (in which case all runtimes would benefit but I assume is unlikely to happen) the Python runtime can't do much.

parrt commented 5 years ago

I wonder if a cython version of the runtime would help much?

ricardojuanpalmaduran commented 5 years ago

I wonder if a cython version of the runtime would help much?

I was wondering the same. I'm not familiar with cython (just learned about it two days ago). If I understood the principle behind it correctly, it works well as long as it's capable of compiling things down to C, so the question is how much of the runtime could end up being compiled as C actually.

For what I read it seems that providing type information (with cython syntax) plays a big role there, but how much of that can help things like managing Python objects? For instance, if part of the slowness of the runtime comes from the fact that we're dealing with Python classes, can cython somehow do some magic that turns that into C code? For what I understood from http://docs.cython.org/en/latest/src/tutorial/cdef_classes.html that is not the case unless we have cython syntax in the runtime.

SimonSntPeter commented 5 years ago

I'm at a bit of a loose end ATM so will try to take a look in the next few days. My suspicion is, as before,that it's the GC (I've no justification for that but I'm on the internet so that's standard). Python uses, at first swipe, a refcounted GC which is utterly different from the generational GC of Java, the latter also being tuned for small, short-lived objects (eden space). My python is 2+ years rusty and I never got to use any kind of profiler - any python experts here give me any pointers? Anyone any thoughts at all, including disagreements with my GC hypothesis?

parrt commented 5 years ago

Problem with cython is that it takes really big changes to get massive speed improvements. gotta un-python it and make it C. I got a 42x speed bump this week on something but code is unreadable ;)

thisiscam commented 5 years ago

One solution I always wanted is to reuse the C++ antlr interface, and generate code for binding interface such as pybind https://github.com/pybind/pybind11

KvanTTT commented 5 years ago

What do you think about safe and fast Rust language? :)

ericvergnaud commented 5 years ago

@ricardojuanpalmaduran Maybe you could try calling gc.disable() before parsing and gc.enable() after? This might help determine whether gc is indeed the root cause, and can therefore be tuned (the default 700, 10, 10 thresholds sound completely inappropriate for large x-ref memory data sets such as the ones built by antlr)

SimonSntPeter commented 5 years ago

@ericvergnaud AIUI (reading up now) in python the refcount GC falls back to a generational GC (with mark-sweep) for cyclic refs. If ANTLR doesn't produce many or any cycles then I'd not really expect the mark-sweep GC to be invoked - but then your 700 for param1 of gc.set_threshold(), yes, I'd like to see what that's doing, assuming I'm reading the docs right. gc.get_count() apparently tells you these. The gc... python object has a decent lot of dynamic info. At worst we may be able to rule out garbage related shenanigans quickly. Anybody got a small sample example (grammar size irrelevant but small amount of text being parsed) that demonstrates this clearly? Sorry, I've not touched ANTLR for about 2.5 years so've completely forgotten everything.

ricardojuanpalmaduran commented 5 years ago

@ericvergnaud disabling GC doesn't seem to have an impact on the parsing time.

SimonSntPeter commented 5 years ago

@ricardojuanpalmaduran Can you send me enough details to reproduce the issue, please? Or link to relevant repo if public. I'll try to have a look monday.

ricardojuanpalmaduran commented 5 years ago

@SimonSntPeter I've attached a self-contained Python project that shows this behavior. Just go the root of the zip file and run the "main.py" script. This script will parse the SQLite schema defined in the _SCHEMA variable, which is just a dumb schema with a few tables and views. The current code disables GC at the beginning of main(), but as I said before this doesn't seem to have a significant impact on the run time.

The parser has been generated through the IntelliJ plugin, using the SQLite syntax found in the antlr4 repository (https://raw.githubusercontent.com/antlr/grammars-v4/master/sqlite/SQLite.g4).

antlr4test.zip

SimonSntPeter commented 5 years ago

Quick update. I've not been able to do much as I've hit a few speedbumps recently that have got in the way. Hooking into the GC gave times per collection. Adding them up gave ~0.6 seconds for GC. If accurate, and if I interpret it correctly, this blows my GC-is-the-problem hypothesis away. I've found other interesting stuff but I'll mention it here when I understand it better. I'll keep poking.

SimonSntPeter commented 5 years ago

Update: still looking at antlr performance but found something weird which substantially affects the speed and consistency of python execution. It's emphatically not an antlr issue though it may help a bit for now. I'd like some opinions if there are any pythonistas here (and OS/low-level guys especially), including where better to post this. I'm using @ricardojuanpalmaduran code, running on windows. I've seen execution times vary a lot, from 12 secs to 27 secs, for no obvious reason. It usually runs ~20 secs. I left my machine for a couple of hours, came back and suddenly it was running at about 8 seconds, and consistently so. Weird. I poked around and noticed a windows process and gone rogue and was eating up an entire core (I have 2 cores; it's a low-end machine) leaving one core for python. OK, that made sense, the OS now scheduled the python process on the remaining core. If I killed the rogue process so freeing up the core, it reverted to its old, slow behaviour. So I forced the python process to live on one core, roughly:

start "pointless_but_necessary_title_needed_to_avoid_bug_in_start_command" /affinity 1 "\python.exe" "\antlr4test\main.py"

It now runs consistently in ~8 secs; about 2.5 times faster on average. What's most likely happening is windows is moving the process between cores (when not pinned to one core, python is visibly sharing cpu time over 2 cores if you look at the task manager). The kernel moving the process between cores causes mass reloads of the unshared caches (L1 and L2) from the shared cache (L3) and from main memory (actually L1 and L2 are likely to be updated directly between cores via the cache-coherency mechanism as well, better but still bad), and that will cause other process's cache to be evicted. This is known to be expensive, and I understand MS windows previously fixed a performance bug with a similar underlying cause, but it looks like here we are again, if I'm right. 250% is frankly shocking.

Can anyone try the 'start ...' command above and confirm? Report this to the python community? Any other thoughts?

(to repeat, I'm still looking at ANTLR, this just popped up as well)

parrt commented 5 years ago

Wow. that is some of the best detective work I've seen!

SimonSntPeter commented 5 years ago

Again, would someone please download @ricardojuanpalmaduran's code and try running it with and without the /affinity command and let me know. I need to know it's not my particular setup.

@parrt - I got lucky. A minor process went postal and used up exactly one of my exactly 2 cores. Very lucky. However I've noticed 30%-50% variation in non-disk-IO bound processes with scala (on JVM), and similar on MSSQL, which is my main job. I've never known why, perhaps I'm about to find out.

parrt commented 5 years ago

I'm trying now on mac using VM ware

parrt commented 5 years ago

well, it ran but didn't dump any info. what am i looking for?

ricardojuanpalmaduran commented 5 years ago

well, it ran but didn't dump any info. what am i looking for?

If I read Simon's comment correctly, you're looking for a significant reduction in execution time when you run the parser with affinity enabled. No dumps or anything, just manual check.

SimonSntPeter commented 5 years ago

@parrt - my mistake, I was unclear. This is on windows. I don't know how to set the affinity on mac/linux, and if there is a problem then it's likely OS-specific anyway, but I'd still be interested if you get the same on non-windows. To avoid using a stopwatch

import time ... t0 = time.time() ... tree = parser.parse() t1 = time.time() print('elapsed secs', t1 - t0) ` Again Mr. Parr, sorry for being unclear.

parrt commented 5 years ago

i get same-same

(base) C:\Users\Terence Parr\antlr4test>python main.py
elapsed secs 6.812229156494141

(base) C:\Users\Terence Parr\antlr4test>start "foo" /affinity 1 python.exe main.py

I saw 6.7s before window disappeared.

This is running in VM ware on a fast imac lightly loaded.

SimonSntPeter commented 5 years ago

@ricardojuanpalmaduran: hi, I've been battering away at this where I could, mostly trying to understand ANTLR's behaviour. Anyway, try this, I got it 3 to 4 times faster by hoisting one grammar rule into another to avoid nesting * or + (these are closures, right?) which seem to be a problem. Also if you can you please set affinity as it's a) seemingly just faster and b) very much more consistent in performance which makes benchmarking changes a lot easier. Don't worry if you can't though. So:

++++++++++ original +++++++++++

parse
 : ( sql_stmt_list | error )* EOF
 ;

error
 : UNEXPECTED_CHAR 
   { 
raise RuntimeException("UNEXPECTED_CHAR=" + $UNEXPECTED_CHAR.text); 
   }
 ;

sql_stmt_list
 : ';'* sql_stmt ( ';'+ sql_stmt )* ';'*
 ;

++++++++++ amended +++++++++++

parse
 : ';' *
   sql_stmt ?
   ( ( ';'+ sql_stmt ) | error )*
   ';' *
   EOF
 ;

error
 : UNEXPECTED_CHAR 
   { 
raise RuntimeException("UNEXPECTED_CHAR=" + $UNEXPECTED_CHAR.text); 
   }
 ;

// sql_stmt_list
//  :
//  //';' *
//  sql_stmt
//  ( ';'+ sql_stmt ) *
//  ';' *
//  ;

I think the amended is grammatically the same as the original in terms of what it accepts, but it doesn't honour the error rule, but for now see if this gets you a speedup.

sharwell commented 5 years ago

Is anyone able to break down the profiled performance of adaptivePredict further? ANTLR 4 has a bunch of subtleties in the implementation that are easy to incorrectly translate from one language (Java) to another language. These have impacted various targets in the past, and the more specific we can get with profiler data the easier it will be to figure out.

SimonSntPeter commented 5 years ago

@sharwell: I've been looking at the code and the ALL(*) paper and I'm badly out of my depth so I've just been experimenting to get an intuitive feeling, hence the above , but yes I've been suspicious of that for a while - closureCheckingStopState and closure_ are getting a mad number of calls so I've been wondering if the DFA cache is working (for 26k input file, sample run: 6.9 million calls in 2.6 secs, roughly 300,000 each for closureCheckingStopState and closure_, and they seem particularly expensive, if I'm reading this right). But not knowing the code, the ALL(*) algorithm, how they relate together, parsing not being my area, it's uphill. I'll try to get some further improvement for the users by tweaking the grammar, then I'll start seriously prodding the underlying code to understand it.

sharwell commented 5 years ago

@SimonSntPeter Any semantic predicate at the beginning of any lexer rule will disable the lexer DFA with overwhelming performance impact. The grammar indicated at the beginning of this topic contains such a predicate here:

https://github.com/pythonql/pythonql/blob/8f77df73ba25f5db2f15c7d430d6c75a9822a8d3/PythonQL.g4#L759-L764

It's unlikely anything will make much of a difference without relocating that predicate (to somewhere after the first character) or eliminating it (via modes).

antlr / antlr4

Grammar that parses in milliseconds in Java runtime, takes many seconds in python #1219