cenotelie / hime

Apache License 2.0
27 stars 4 forks source link

[.NET] Really slow parsing #63

Open woutersl opened 5 years ago

woutersl commented 5 years ago

Original report by David Aramant (Bitbucket: 557058:90c5a420-b317-4f31-bffe-8f5c9effcd9f, GitHub: davidaramant).


I've been trying out various parsing solutions for a hobby project and I'm having some troubles with Hime. I've got a parser working, but it's incredibly slow compared to other frameworks (>16s for Hime and <1s for ANTLR on a large input file). I'm not sure if there's a problem with the grammar that's causing (I'm fairly new to parser frameworks) this or some other issue.

The Hime grammar file is here. For reference, the spec for the format I'm parsing helpfully includes a grammar.

woutersl commented 5 years ago

Original comment by Laurent Wouters (Bitbucket: 557058:675792b6-d731-4823-9f7d-c6dfcb2df2b5, ).


I've taken a look at the grammar and one possible explanation for large input file is the use of * and + operators in syntactic rules (not in terminals). As mentioned in https://cenotelie.fr/projects/hime/reference/lang-rules.html in section Repetition Operators, the parser generator will transform rules with those operators and inject variables to obtain a canonical form compatible with parsing algorithms. At runtime, those generated variables are recognized and the parser has logic to reconstruct what would be expected by the user according to the original grammar.

For example

a -> x y+ ;

is re-written as

_genvar -> y ;
_genvar -> _genvar y ;
a -> x _genvar ;

However, the produced AST will still look like a root labelled 'a' with a single child node labelled 'x' and one or more nodes labelled 'y'. This form is reconstructed at runtime and although the logic is quite performant for small cases, it can be expensive for huge input where the number of 'y' children in this case is very big.

If this is indeed your case, one solution is to manually rewrite the grammar in the same manner to remove the use of the offending + (or *) operator. You will avoid the pathological case described above but the downside is the AST will have another form, perhaps slightly less readable.

In your grammar, perhaps the base option would be to try to replace

translation_unit -> global_expr+;

with a recursive definition like

translation_unit -> global_exprs;
global_exprs -> global_expr global_exprs;

The AST then won't look flat like all global_expr at the root as before, but it should be way faster for a huge file with all expressions at the file's root (grammatically speaking).

woutersl commented 5 years ago

Original comment by David Aramant (Bitbucket: 557058:90c5a420-b317-4f31-bffe-8f5c9effcd9f, GitHub: davidaramant).


Looks like you're right. Getting rid of the + for global expressions reduced the time to 7s. Interestingly when I removed the * for assignment expressions it increased to 8s, but I'll play with it some more.

igor-krawczuk commented 1 year ago

I just encountered the same issue when playing with the library for a simple rust based tree parser for the listops dataset. Applying the fix dropped the parsing time for the training set (90e3 lines) from 30 seconds to basically instant. However, the recursive form isn't as elegant as you noted. If you have time, could you link to the sections of the code which is doing the _genvar transformation/reconstruction? If I had to guess I would expect the slowdown to come from (either) the insertion and/or the reconstruction involving copies/clones, I would like to check it out and see if I can find a fix at some point.

Thank you for open sourcing this very nice tool btw!

woutersl commented 1 year ago

Hi, yes this is a general problem. Here are links to the relevant locations in the implementation :

Basically, when constructing an AST with the use of a + or * operator, the reconstruction process happens from the bottom-up each time the parent rule is reduced. For example, with the grammar

a -> x y+ ;

we have the final rules:

_genvar -> y ;
_genvar -> _genvar y ;
a -> x _genvar ;

Then, at parse-time, given the input xyyy, the following happens: having read xy we have a parts of ASTs as follow:

x 
_genvar(0) -> y

then we read the second y, we recopy the nodes for _genvar(0) to obtain

x 
_genvar(1) -> y y

then we read the third y, and once again, we recopy the nodes for _genvar(1) to obtain

x 
_genvar(2) -> y y y

and finally at the end we reduce a and recopy once again all the nodes for _genvar(2)

a -> x y y y

This process is indeed not efficient when the + or * operators match a large number of repeating elements because we recopy again and again the repeating node. Optimizing this to avoid the copy is not a trivial task though because at parse-time, in the repeating sequence, we do not know how many repeating elements there will be.