amykyta3 / speedy-antlr-tool

Generate an accelerator extension that makes your Antlr parser in Python super-fast!
BSD 3-Clause "New" or "Revised" License
29 stars 7 forks source link

some rule ctx missing start/stop tokens #3

Closed macintoshpie closed 2 years ago

macintoshpie commented 4 years ago

Expected behavior

Every rule context has a .start and .stop attribute which contains a token

Actual behavior

Some rules are missing one or both of these tokens

Steps to reproduce

  1. clone this example repo https://github.com/macintoshpie/speedy_antlr_modelica_tokens
  2. Run python -m pip install .
  3. Run python example.py This will print out the start and stop token at every rule context using both parser implementations. Notice that for the python implementation there are not tokens that are None

Thanks for creating this package! Hopefully I'm not doing something obviously wrong, I'm a total antlr and C++ beginner.

amykyta3 commented 4 years ago

I ran through your example and added some debug statements to figure out whats going on. In the nodes where start/stop is missing, the parse tree shows that these nodes do not contain any children, or contain s node with no children.

I'm marking this as a bug, since I definitely don't carry over the start/stop attributes properly if a grammar rule context matched on nothing.

That said, I'd recommend trying to clean up the grammar you're using from grammars-v4. Looking at that grammar definition, one thing that stood out is that it contains several rules that will actually match on no tokens (namely the comment rules). This isn't really good practice since it will generate a bunch of meaningless empty parse tree nodes (and hurt performance on the way). I suspect at some point, a different author tacked on these comment nodes, and made kind of a mess of it.

For example, this sequence of rules:

import_clause
   : 'import' (IDENT '=' name | name '.*' | name '.{' import_list '}' | name ) comment
   ;

comment
   : string_comment (annotation)?
   ;

string_comment
   : (STRING ('+' STRING)*)?
   ;

Defines a grammar where the comment and string_comment rules will match regardless if they consume tokens or not. If the intent was to make comments optional, this would have been more appropriate:

import_clause
   : 'import' (IDENT '=' name | name '.*' | name '.{' import_list '}' | name ) comment? annotation?
   ;

comment
   : STRING ('+' STRING)*
   ;

I totally might have missed something since I'm not familiar with the modelica language, but I recommend you take a closer look at the grammar. You might find that you can make some significant improvements if you fix that one up, or even write your own based on the original language spec https://www.modelica.org/documents/ModelicaSpec34.pdf (see page 265)

macintoshpie commented 4 years ago

Great, thank you for the feedback and suggestions! I'll look into tweaking the grammar