Antlr python 2 not recoginizing the correct end line for class, function, statements and loops

Himanshu-portfolio commented 3 months ago

ANTLR 4 does not recognize the end lines correctly. Below is the link to the grammar file we used with an example

Link: https://github.com/antlr/grammars-v4/tree/master/python/python2_7_18

def test ()
print hello
3
def another ()
print hello 2

start line for test is 1 but the end line is start line of another function which is 4

kaby76 commented 3 months ago

def test ()

print hello

WebIDL Grammar #3

def another ()

print hello 2

What is the input? The above input text is illegal Python2 source code. (Try it in https://onecompiler.com/python2/42j94gnxp.)

def test() does not end in a colon.
print hello is not indented within the definition of test().

Further, we cannot tell if you are using \n or \r\n or \n\r newline character sequences. It's only possible to know which if you attach a .txt file. In lieu of that, please edit the above comment with the input nested in a triple-backtick quoted block. See https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code.

Ranjani-devz commented 3 months ago

Thank you kaby76 for suggesting to input in a triple-backtip quoted block.

Adding on to Himanshu's issue, sharing the below code snippet where function test() starts at line number 1 and ends at line number 3 at the print statement but the ANTLR Python 2.7.18 grammar finds the end line of test() function as the start of the next function greet() which is at line number 5.

def test():
    xxx=1
    print xxx

def greet():
  print 'Hello World'

greet();

RobEin commented 3 months ago

DEDENT token is placed in the 5th line because it is detected there. Also try Python's tokenizer: python -m tokenize test.py -e

It also places the DEDENT token in the 5th line.

kaby76 commented 3 months ago

I agree, I'm not sure what the problem is here.

Input:

def test():
    xxx=1
    print xxx

def greet():
  print 'Hello World'

greet();

Or in file: xxx.txt.

The parse tree is:


( file_input
  ( stmt
    ( compound_stmt
      ( funcdef
        ( DEF
          (  text:'def' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( Attribute WS Value ' ' chnl:HIDDEN
        ) 
        ( NAME
          (  text:'test' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( parameters
          ( LPAR
            (  text:'(' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( RPAR
            (  text:')' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) ) 
        ( COLON
          (  text:':' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( suite
          ( NEWLINE
            (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( Attribute WS Value '    ' chnl:HIDDEN
          ) 
          ( INDENT
            (  text:'<INDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( stmt
            ( simple_stmt
              ( small_stmt
                ( expr_stmt
                  ( testlist
                    ( test
                      ( or_test
                        ( and_test
                          ( not_test
                            ( comparison
                              ( expr
                                ( xor_expr
                                  ( and_expr
                                    ( shift_expr
                                      ( arith_expr
                                        ( term
                                          ( factor
                                            ( power
                                              ( atom
                                                ( NAME
                                                  (  text:'xxx' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                  ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
                  ( EQUAL
                    (  text:'=' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                  ) ) 
                  ( testlist
                    ( test
                      ( or_test
                        ( and_test
                          ( not_test
                            ( comparison
                              ( expr
                                ( xor_expr
                                  ( and_expr
                                    ( shift_expr
                                      ( arith_expr
                                        ( term
                                          ( factor
                                            ( power
                                              ( atom
                                                ( NUMBER
                                                  (  text:'1' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
              ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
              ( NEWLINE
                (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) ) ) 
          ( stmt
            ( simple_stmt
              ( small_stmt
                ( print_stmt
                  ( Attribute WS Value '    ' chnl:HIDDEN
                  ) 
                  ( PRINT
                    (  text:'print' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                  ) ) 
                  ( test
                    ( or_test
                      ( and_test
                        ( not_test
                          ( comparison
                            ( expr
                              ( xor_expr
                                ( and_expr
                                  ( shift_expr
                                    ( arith_expr
                                      ( term
                                        ( factor
                                          ( power
                                            ( atom
                                              ( Attribute WS Value ' ' chnl:HIDDEN
                                              ) 
                                              ( NAME
                                                (  text:'xxx' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
              ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
              ( Attribute NEWLINE Value '\r\n' chnl:HIDDEN
              ) 
              ( NEWLINE
                (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) ) ) 
          ( DEDENT
            (  text:'<DEDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
  ) ) ) ) ) ) 
  ( stmt
    ( compound_stmt
      ( funcdef
        ( DEF
          (  text:'def' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( Attribute WS Value ' ' chnl:HIDDEN
        ) 
        ( NAME
          (  text:'greet' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( parameters
          ( LPAR
            (  text:'(' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( RPAR
            (  text:')' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) ) 
        ( COLON
          (  text:':' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
        ) ) 
        ( suite
          ( NEWLINE
            (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( Attribute WS Value '  ' chnl:HIDDEN
          ) 
          ( INDENT
            (  text:'<INDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) 
          ( stmt
            ( simple_stmt
              ( small_stmt
                ( print_stmt
                  ( PRINT
                    (  text:'print' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                  ) ) 
                  ( test
                    ( or_test
                      ( and_test
                        ( not_test
                          ( comparison
                            ( expr
                              ( xor_expr
                                ( and_expr
                                  ( shift_expr
                                    ( arith_expr
                                      ( term
                                        ( factor
                                          ( power
                                            ( atom
                                              ( Attribute WS Value ' ' chnl:HIDDEN
                                              ) 
                                              ( STRING
                                                (  text:''Hello World'' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
              ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
              ( Attribute NEWLINE Value '\r\n' chnl:HIDDEN
              ) 
              ( Attribute WS Value '  ' chnl:HIDDEN
              ) 
              ( NEWLINE
                (  text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
          ) ) ) ) 
          ( DEDENT
            (  text:'<DEDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
  ) ) ) ) ) ) 
  ( stmt
    ( simple_stmt
      ( small_stmt
        ( expr_stmt
          ( testlist
            ( test
              ( or_test
                ( and_test
                  ( not_test
                    ( comparison
                      ( expr
                        ( xor_expr
                          ( and_expr
                            ( shift_expr
                              ( arith_expr
                                ( term
                                  ( factor
                                    ( power
                                      ( atom
                                        ( NAME
                                          (  text:'greet' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                                      ) ) ) 
                                      ( trailer
                                        ( LPAR
                                          (  text:'(' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
                                        ) ) 
                                        ( RPAR
                                          (  text:')' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
      ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 
      ( SEMI
        (  text:';' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
      ) ) 
      ( NEWLINE
        (  text:'<NEWLINE>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
  ) ) ) ) 
  ( EOF
    (  text:'' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) )

The tokens are:

[@0,0:2='def',<9>,1:0]
[@1,3:3=' ',<84>,channel=1,1:3]
[@2,4:7='test',<79>,1:4]
[@3,8:8='(',<34>,1:8]
[@4,9:9=')',<37>,1:9]
[@5,10:10=':',<40>,1:10]
[@6,11:12='\r\n',<82>,1:11]
[@7,13:16='    ',<84>,channel=1,2:0]
[@8,17:16='<INDENT>',<1>,2:4]
[@9,17:19='xxx',<79>,2:4]
[@10,20:20='=',<51>,2:7]
[@11,21:21='1',<80>,2:8]
[@12,22:23='\r\n',<82>,2:9]
[@13,24:27='    ',<84>,channel=1,3:0]
[@14,28:32='print',<27>,3:4]
[@15,33:33=' ',<84>,channel=1,3:9]
[@16,34:36='xxx',<79>,3:10]
[@17,37:38='\r\n',<82>,channel=1,3:13]
[@18,39:40='\r\n',<82>,4:0]
[@19,41:40='<DEDENT>',<2>,5:0]
[@20,41:43='def',<9>,5:0]
[@21,44:44=' ',<84>,channel=1,5:3]
[@22,45:49='greet',<79>,5:4]
[@23,50:50='(',<34>,5:9]
[@24,51:51=')',<37>,5:10]
[@25,52:52=':',<40>,5:11]
[@26,53:54='\r\n',<82>,5:12]
[@27,55:56='  ',<84>,channel=1,6:0]
[@28,57:56='<INDENT>',<1>,6:2]
[@29,57:61='print',<27>,6:2]
[@30,62:62=' ',<84>,channel=1,6:7]
[@31,63:75=''Hello World'',<81>,6:8]
[@32,76:77='\r\n',<82>,channel=1,6:21]
[@33,78:79='  ',<84>,channel=1,7:0]
[@34,80:81='\r\n',<82>,7:2]
[@35,82:81='<DEDENT>',<2>,8:0]
[@36,82:86='greet',<79>,8:0]
[@37,87:87='(',<34>,8:5]
[@38,88:88=')',<37>,8:6]
[@39,89:89=';',<42>,8:7]
[@40,90:89='<NEWLINE>',<82>,8:8]
[@41,90:89='<EOF>',<-1>,8:8]

According to the Official Python2 grammar, https://docs.python.org/2.7/reference/grammar.html, a funcdef is funcdef: 'def' NAME parameters ':' suite. It extends from the first character 'd' of def, and goes all the way to the last character of DEDENT, since suite is defined as suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT.

If you want to get the interval for the statements within function "test()", then you have to get the last char of the 2nd stmt. It says there are two statements for function "test()":

$ trparse xxx.txt | trquery grep ' //stmt/compound_stmt/funcdef[NAME/text() = "test"]/suite/stmt' | trtext -c
CSharp 0 xxx.txt success 0.0428541
2
07/05-12:35:54 ~/issues/g4-new-csharp/python/python2_7_18/Generated-CSharp-0
$ trparse -l xxx.txt | trquery grep ' //stmt/compound_stmt/funcdef[NAME/text() = "test"]/suite/stmt[1]' | trcaret
CSharp 0 xxx.txt success 0.0425021
L2:     xxx=1
        ^
07/05-12:36:00 ~/issues/g4-new-csharp/python/python2_7_18/Generated-CSharp-0
$ trparse -l xxx.txt | trquery grep ' //stmt/compound_stmt/funcdef[NAME/text() = "test"]/suite/stmt[2]' | trcaret
CSharp 0 xxx.txt success 0.0426761
L3:     print xxx
        ^

kaby76 commented 3 months ago

The only thing that would be nice to change is the text for the INDENT and DEDENT tokens. They are <INDENT> and <DEDENT> respectively. But the text is inconsistent with the computed length of the token, which is end index - start index + 1 = 0. So, for the first INDENT token, [@8,17:16='<INDENT>',<1>,2:4], the attributes of the token are:

start index 17.
end index is 16.
text is <INDENT>.
channel is 1.
line is 2.
column is 4.

The "problem" is on the trtext-side of things. trtext reconstructs the text of the input by concatenating the text of the leaves of the parse tree. So, I see <INDENT> and <DEDENT> sprinkled in the reconstructed text. I can easily remove these from the tree using trquery delete.

RobEin commented 3 months ago

Thanks for bringing this to my attention. I really forgot about that. In other words, the token stream must ensure that the original source code can be restored. And this is not possible with "<INDENT>" and "<DEDENT>" token text.

I will fix it in all PythonLexerBase ports:

Java
C#
Python
JavaScript
TypeScript
Go
Dart
CPP

Ranjani-devz commented 3 months ago

Thanks kaby76 and RobEin for checking on this issue. Waiting for your update if it is fixed in PythonLexerBase for Java.

RobEin commented 2 months ago

On second thought, no repair is needed after all. The rule is very simple to restore the original source code by the token stream. You just have to take out the INDENT and DEDENT tokens. Python's tokenizer works differently. The INDENT and DEDENT tokens must be inserted there to restore the original code. I'm still wondering if there's any advantage to this, but probably not.

kaby76 commented 2 months ago

The rule is very simple to restore the original source code by the token stream[:] You just have to take out the INDENT and DEDENT tokens. ... The INDENT and DEDENT tokens must be inserted there to restore the original code.

I don't understand. These two statements are inconsistent. The first statement says that the INDENT and DEDENT tokens need to be deleted from the parse tree in order to reconstruct the source. The second statement says that they cannot be deleted because they are essential to reconstruct the source.

Currently, I have to delete the INDENT and DEDENT tokens to reconstruct the text because if I don't I get INDENT and DEDENT strings sprinkled in the reconstructed text, e.g., this:

07/07-08:05:20 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ trparse ../examples/atexit.py | trtext
CSharp 0 ../examples/atexit.py success 0.0601923
"""
atexit.py - allow programmer to define multiple exit functions to be executed
upon normal program termination.

One public function, register, is defined.
"""

__all__ = ["register"]

import sys

_exithandlers = []
def _run_exitfuncs():
    <INDENT>"""run any registered exit functions

    _exithandlers is traversed in reverse order so functions are executed
    last in, first out.
    """

    exc_info = None
    while _exithandlers:
        <INDENT>func, targs, kargs = _exithandlers.pop()
        try:
            <INDENT>func(*targs, **kargs)
        <DEDENT>except SystemExit:
            <INDENT>exc_info = sys.exc_info()
        <DEDENT>except:
            <INDENT>import traceback
            print >> sys.stderr, "Error in atexit._run_exitfuncs:"
            traceback.print_exc()
            exc_info = sys.exc_info()

    <DEDENT><DEDENT>if exc_info is not None:
        <INDENT>raise exc_info[0], exc_info[1], exc_info[2]

<DEDENT><DEDENT>def register(func, *targs, **kargs):
    <INDENT>"""register a function to be executed upon normal program termination

    func - function to be called at exit
    targs - optional arguments to pass to func
    kargs - optional keyword arguments to pass to func

    func is returned to facilitate usage as a decorator.
    """
    _exithandlers.append((func, targs, kargs))
    return func

<DEDENT>if hasattr(sys, "exitfunc"):
    # Assume it's another registered exit function - append it to our list
    <INDENT>register(sys.exitfunc)
<DEDENT>sys.exitfunc = _run_exitfuncs

if __name__ == "__main__":
    <INDENT>def x1():
        <INDENT>print "running x1"
    <DEDENT>def x2(n):
        <INDENT>print "running x2(%r)" % (n,)
    <DEDENT>def x3(n, kwd=None):
        <INDENT>print "running x3(%r, kwd=%r)" % (n, kwd)

    <DEDENT>register(x1)
    register(x2, 12)
    register(x3, 5, "bar")
    register(x3, "no kwd args")
<DEDENT>
07/07-08:05:40 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$

Text reconstruction in Trash follows the basic concept that existed in CS since the 1960's: the input text is simply the concatenation of the text of the frontier of the parse tree. The text for INDENT and DEDENT tokens are <INDENT> and <DEDENT>. This is why I need to either erase the text (which I currently cannot do with Trash), or the tokens need to be deleted from the parse tree, e.g.,:

07/07-07:59:04 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ trparse !$ | trquery 'delete //(DEDENT | INDENT)' | trtext
trparse ../examples/atexit.py | trquery 'delete //(DEDENT | INDENT)' | trtext
CSharp 0 ../examples/atexit.py success 0.0612294
"""
atexit.py - allow programmer to define multiple exit functions to be executed
upon normal program termination.

One public function, register, is defined.
"""

__all__ = ["register"]

import sys

_exithandlers = []
def _run_exitfuncs():
    """run any registered exit functions

    _exithandlers is traversed in reverse order so functions are executed
    last in, first out.
    """

    exc_info = None
    while _exithandlers:
        func, targs, kargs = _exithandlers.pop()
        try:
            func(*targs, **kargs)
        except SystemExit:
            exc_info = sys.exc_info()
        except:
            import traceback
            print >> sys.stderr, "Error in atexit._run_exitfuncs:"
            traceback.print_exc()
            exc_info = sys.exc_info()

    if exc_info is not None:
        raise exc_info[0], exc_info[1], exc_info[2]

def register(func, *targs, **kargs):
    """register a function to be executed upon normal program termination

    func - function to be called at exit
    targs - optional arguments to pass to func
    kargs - optional keyword arguments to pass to func

    func is returned to facilitate usage as a decorator.
    """
    _exithandlers.append((func, targs, kargs))
    return func

if hasattr(sys, "exitfunc"):
    # Assume it's another registered exit function - append it to our list
    register(sys.exitfunc)
sys.exitfunc = _run_exitfuncs

if __name__ == "__main__":
    def x1():
        print "running x1"
    def x2(n):
        print "running x2(%r)" % (n,)
    def x3(n, kwd=None):
        print "running x3(%r, kwd=%r)" % (n, kwd)

    register(x1)
    register(x2, 12)
    register(x3, 5, "bar")
    register(x3, "no kwd args")

07/07-07:59:38 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ trparse ../examples/atexit.py | trquery 'delete //(DEDENT | INDENT)' | trtext > save
CSharp 0 ../examples/atexit.py success 0.0600218
07/07-07:59:48 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ diff save ../examples/atexit.py
66d65
<
07/07-07:59:57 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0

NB: trtext outputs an extra newline character because it calls Console.WriteLine() instead of a Console.Write(). It has to do this because dotnet programs don't work perfectly with a Cygwin/MSYS shell. Instead, one should use trsponge to perform the reconstruction and outputting.

RobEin commented 2 months ago

The second statement says that they cannot be deleted because they are essential to reconstruct the source.

The second statement was about the original Python tokenizer.

... Trash follows the basic concept that existed in CS since the 1960's: the input text is simply the concatenation of the text of the frontier of the parse tree ...

Now I understand what the problem is. I didn't know this recommendation.

I can imagine two alternatives in this case:

Solution 1: The text of the INDENT/DEDENT tokens would contain the indentation similar to Python's tokenizer. Currently, the indentation text is stored in the WS tokens before the INDENT/DEDENT tokens. This is problematic because it may cause compatibility problems with older applications that use the PythonLexerBase class.
Solution 2: This is simpler and less likely to cause compatibility issues. That is, INDENT/DEDENT tokens would store an empty string. Currently, the text property of INDENT tokens is consistently "<INDENT>" and similarly that of DEDENT tokens is "<DEDENT>". If these are now empty strings, then the text properties of the tokens should only be concatenated to restore the original source code. This would be similar to deleting INDENT/DEDENT tokens.

I recommend the second solution.

Ranjani-devz commented 2 months ago

I didn't understand. Can you explain me what has to be changed. Do I need to change in any grammar files?

We are trying to parse python 2x file using Java. When i tried to print the FuncdefContext.suite.getText() of test() function for this example,

def test():
    xxx=1
    print xxx

def greet():
  print 'Hello World'

greet();

Output:

<INDENT>xxx=1
printxxx
<DEDENT>

and endline for this test() function is 5.

Can you tell me what should be done here to get the the correct endline.

kaby76 commented 2 months ago

tree.getText() doesn't reconstruct the text of the input. It never does for virtually every Antlr grammar! This is because Antlr parse trees don't contain all the tokens of the input, like comments and white space, nor does it contain strings that are "skipped." Grammars that define lexer rules with -> skip or -> channel(HIDDEN) cause input strings to be not tokenized or tokenized with the channel property to be 1. The leaves in the parse tree don't contain these tokens. For python2_7_18, the DEDENT and INDENT tokens contain text as strings <DEDENT> and <INDENT> and these tokens are part of the Antlr parse tree. This is why you see tree.getText() contain strings for the DEDENT and INDENT tokens. The "approved" way to get the text from an Antlr parse tree is to query the input char stream directly, using the parse tree to get the bounds of the indices of the text. See https://stackoverflow.com/a/55852474/4779853 or https://github.com/antlr/antlr4/issues/1302

Trash doesn't represent the parse tree like Antlr. It incorporates the entire input, including white space and comments. It's done this way so that it's fully serializable, with no loss of text, and fully editable. The way Antlr splits the parse tree from the token stream, and char stream, is unnatural, difficult/slow to serialize and edit.

Ranjani-devz commented 2 months ago

Hi, Thanks for your response. I understand that you have suggested on how to get the text from ANTLR parse tree.

Our use case is to parse input python file and identify the startline and endline for each classes, functions, statements, comments, etc. in the file and while doing so we are facing an issue fetching endline from the function and statements context (for, while loop,...)

Can you help me understand how this endline can be fetched correctly or is there any workaround you would like to suggest.
Also, does ANTLR python 2.7.18 grammars support python 2.6 version too?

kaby76 commented 2 months ago

The easiest solution would be to just delete the INDENT and DEDENT leaves, then just get the Interval for the sub-tree. But, the Antlr runtime doesn't have tree editing.

Instead, do this: 1) Get the Interval of the node for the funcdef or stmt. The Interval is the start and end indices of the tokens for that sub-tree (i.e., not the start and end of the character buffer). 1) Write a loop to start at the ending token index. Working backwards, skip all INDENT and DEDENT tokens until you find something else, something that is not an INDENT or DEDENT. Do not backup further than the starting token index. We now have the end token index of the funcdef or stmt. 1) Get the end token from its end token index. 1) Get the end character index from the end token. 1) Write a loop that starts at end character index and looks at the character buffer. Stop looping when you find a character that is not a newline, character index of last non-newline for funcdef or stmt. 1) You can now return the 1+character index of last non-newline for funcdef or stmt

In C#:

        var funcdefs = new Antlr4.Runtime.Tree.Xpath.XPath(parser, "//funcdef").Evaluate(tree);
        var funcdef = funcdefs.FirstOrDefault();
        var token_interval = funcdef.SourceInterval;
        int end_token_index = token_interval.b;
        for (; end_token_index >= token_interval.a; --end_token_index)
        {
            if (tokens.Get(end_token_index).Type != PythonParser.INDENT
                && tokens.Get(end_token_index).Type != PythonParser.DEDENT
                && tokens.Get(end_token_index).Type != PythonParser.WS
                && tokens.Get(end_token_index).Type != PythonParser.NEWLINE
                && tokens.Get(end_token_index).Channel == 0)
            {
                break;
            }
        }
        var start_token = tokens.Get(token_interval.a);
        var end_token = tokens.Get(end_token_index);
        var start_char_index = start_token.StartIndex;
        var end_char_index = end_token.StopIndex;
        System.Console.WriteLine("funcdef text:");
        System.Console.WriteLine(str.GetText(new Interval(start_char_index, end_char_index)));

[D]oes ANTLR python 2.7.18 grammars support python 2.6 version too?

I would think so, but don't quote me.

Ranjani-devz commented 2 months ago

Hi, Thanks for you response. We will check on the suggestion you have provided as we have built in Java. Also, in our case we are using custom listener class to identify the endlines for each class, function, statement, etc. by overriding the base listener enter and exit methods.

For Example:

@Override
public void enterFuncdef(FuncdefContext ctx) {
     int start = ctx.getStart.getLine();
     int stop = ctx.getStop.getLine();
}

Would you like to suggest if we can handle the endlines correctly here?

kaby76 commented 2 months ago

[W]e are using custom listener class to identify the endlines for each class, function, statement, etc. by overriding the base listener enter and exit methods.

For Example:
@Override
public void enterFuncdef(FuncdefContext ctx) {
     int start = ctx.getStart.getLine();
     int stop = ctx.getStop.getLine();
}
Would you like to suggest if we can handle the endlines correctly here?

Not quite. Try this.

MyListener.java

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.misc.*;

public class MyListener extends PythonParserBaseListener {

    CommonTokenStream tokens_;
    CharStream str_;

    public MyListener(CommonTokenStream tokens, CharStream str)
    {
    tokens_ = tokens;
    str_ = str;
    }

    @Override public void enterFuncdef(PythonParser.FuncdefContext ctx)
    {
    var start = ctx.getStart().getLine();
    var token_interval = ctx.getSourceInterval();
    var end_token_index = token_interval.b;
    var tokens = this.tokens_;
    var str = this.str_;
    for (; end_token_index >= token_interval.a; --end_token_index)
    {
        if (tokens.get(end_token_index).getType() != PythonParser.INDENT
          && tokens.get(end_token_index).getType() != PythonParser.DEDENT
          && tokens.get(end_token_index).getType() != PythonParser.WS
          && tokens.get(end_token_index).getType() != PythonParser.NEWLINE
          && tokens.get(end_token_index).getChannel() == 0)
        {
        break;
        }
    }
    var start_token = tokens.get(token_interval.a);
    var end_token = tokens.get(end_token_index);
    var start_char_index = start_token.getStartIndex();
    var end_char_index = end_token.getStopIndex();
    var stop_line_number = end_token.getLine();
    System.out.println("stop = " + stop_line_number);
    System.out.println("funcdef text:");
    System.out.println(str.getText(new Interval(start_char_index, end_char_index)));
    }
}

RobEin commented 2 months ago

The "problem" is on the trtext-side of things. trtext reconstructs the text of the input by concatenating the text of the leaves of the parse tree. So, I see <INDENT> and <DEDENT> sprinkled in the reconstructed text.

It looks like the text property of the inserted tokens is still necessary. In the left tree view, the text property of the inserted INDENT, DEDENT and NEWLINE tokens is as follows:

<INDENT>
<DEDENT>
<NEWLINE>

In the tree view on the right, there are empty strings for the inserted tokens, so that the original source code can be reconstructed with a simple concatenation. However, because of this, the tree view on the right becomes unreadable.

Furthermore, the text property of the EOF token generated by ANTLR is also not an empty string, but: <EOF>

To recover the original source code with a simple concatenation, this should also be an empty string.

kép

python example (there is no new line after the continue statement):

if True:
    continue

to show the tree view: grun Python file_input -gui example.py

kaby76 commented 2 months ago

Furthermore, the text property of the EOF token generated by ANTLR is also not an empty string, but <EOF>

It might be a good idea to add to the readme.md comments on text recovery for the python grammars. E.g., for Trash, trparse x.txt | trquery delete ' //(INDENT | DEDENT | NEWLINE[text()="<NEWLINE>"])' | trsponge.

RobEin commented 2 months ago

It might be a good idea to add to the readme.md comments on text recovery for the python grammars.

Good idea, I support it. However, wouldn't such a function belong in the CommonTokenStream class? e.g. with the following method name: GetRecoveredInputText()

Which method could be overridden in special cases in an inherited class. In our case, e.g. with the following class name: PythonCommonTokenStream

Also, I'm thinking that inserted tokens (including EOF) could get a separate token channel during tokenization. eg: INSERTED_CHANNEL Thus, the original source code would be just a simple concatenation, but omitting the tokens from the INSERTED_CHANNEL. I don't know if the Trash can filter by channel instead of delete.

kaby76 commented 2 months ago

I don't know if the Trash can filter by channel instead of delete.

I don't think so at the moment. Token text is addressed using a function text(), e.g., ID/text() would be the name of an ID. Off-channel tokens are addressed using @-sign, e.g., @COMMENT. You can even say @COMMENT/text() and make queries about the comment containing certain text. But, I don't think I defined functions for channel--mainly because the engine is XPath 2.

antlr / grammars-v4

Antlr python 2 not recoginizing the correct end line for class, function, statements and loops #4153

3

MyListener.java