Open Himanshu-portfolio opened 3 months ago
- def test ()
- print hello
- WebIDL Grammar #3
- def another ()
- print hello 2
What is the input? The above input text is illegal Python2 source code. (Try it in https://onecompiler.com/python2/42j94gnxp.)
def test()
does not end in a colon.print hello
is not indented within the definition of test()
.Further, we cannot tell if you are using \n
or \r\n
or \n\r
newline character sequences. It's only possible to know which if you attach a .txt file. In lieu of that, please edit the above comment with the input nested in a triple-backtick quoted block. See https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#quoting-code.
Thank you kaby76 for suggesting to input in a triple-backtip quoted block.
Adding on to Himanshu's issue, sharing the below code snippet where function test() starts at line number 1 and ends at line number 3 at the print statement but the ANTLR Python 2.7.18 grammar finds the end line of test() function as the start of the next function greet() which is at line number 5.
def test():
xxx=1
print xxx
def greet():
print 'Hello World'
greet();
DEDENT token is placed in the 5th line because it is detected there.
Also try Python's tokenizer:
python -m tokenize test.py -e
It also places the DEDENT token in the 5th line.
I agree, I'm not sure what the problem is here.
Input:
def test():
xxx=1
print xxx
def greet():
print 'Hello World'
greet();
Or in file: xxx.txt.
The parse tree is:
( file_input
( stmt
( compound_stmt
( funcdef
( DEF
( text:'def' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( Attribute WS Value ' ' chnl:HIDDEN
)
( NAME
( text:'test' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( parameters
( LPAR
( text:'(' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( RPAR
( text:')' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) )
( COLON
( text:':' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( suite
( NEWLINE
( text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( Attribute WS Value ' ' chnl:HIDDEN
)
( INDENT
( text:'<INDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( stmt
( simple_stmt
( small_stmt
( expr_stmt
( testlist
( test
( or_test
( and_test
( not_test
( comparison
( expr
( xor_expr
( and_expr
( shift_expr
( arith_expr
( term
( factor
( power
( atom
( NAME
( text:'xxx' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
( EQUAL
( text:'=' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( testlist
( test
( or_test
( and_test
( not_test
( comparison
( expr
( xor_expr
( and_expr
( shift_expr
( arith_expr
( term
( factor
( power
( atom
( NUMBER
( text:'1' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
( NEWLINE
( text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) )
( stmt
( simple_stmt
( small_stmt
( print_stmt
( Attribute WS Value ' ' chnl:HIDDEN
)
( PRINT
( text:'print' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( test
( or_test
( and_test
( not_test
( comparison
( expr
( xor_expr
( and_expr
( shift_expr
( arith_expr
( term
( factor
( power
( atom
( Attribute WS Value ' ' chnl:HIDDEN
)
( NAME
( text:'xxx' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
( Attribute NEWLINE Value '\r\n' chnl:HIDDEN
)
( NEWLINE
( text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) )
( DEDENT
( text:'<DEDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) ) ) )
( stmt
( compound_stmt
( funcdef
( DEF
( text:'def' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( Attribute WS Value ' ' chnl:HIDDEN
)
( NAME
( text:'greet' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( parameters
( LPAR
( text:'(' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( RPAR
( text:')' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) )
( COLON
( text:':' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( suite
( NEWLINE
( text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( Attribute WS Value ' ' chnl:HIDDEN
)
( INDENT
( text:'<INDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( stmt
( simple_stmt
( small_stmt
( print_stmt
( PRINT
( text:'print' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( test
( or_test
( and_test
( not_test
( comparison
( expr
( xor_expr
( and_expr
( shift_expr
( arith_expr
( term
( factor
( power
( atom
( Attribute WS Value ' ' chnl:HIDDEN
)
( STRING
( text:''Hello World'' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
( Attribute NEWLINE Value '\r\n' chnl:HIDDEN
)
( Attribute WS Value ' ' chnl:HIDDEN
)
( NEWLINE
( text:'\r\n' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) )
( DEDENT
( text:'<DEDENT>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) ) ) )
( stmt
( simple_stmt
( small_stmt
( expr_stmt
( testlist
( test
( or_test
( and_test
( not_test
( comparison
( expr
( xor_expr
( and_expr
( shift_expr
( arith_expr
( term
( factor
( power
( atom
( NAME
( text:'greet' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) )
( trailer
( LPAR
( text:'(' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( RPAR
( text:')' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
( SEMI
( text:';' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) )
( NEWLINE
( text:'<NEWLINE>' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) ) )
( EOF
( text:'' tt:0 chnl:DEFAULT_TOKEN_CHANNEL
) ) )
The tokens are:
[@0,0:2='def',<9>,1:0]
[@1,3:3=' ',<84>,channel=1,1:3]
[@2,4:7='test',<79>,1:4]
[@3,8:8='(',<34>,1:8]
[@4,9:9=')',<37>,1:9]
[@5,10:10=':',<40>,1:10]
[@6,11:12='\r\n',<82>,1:11]
[@7,13:16=' ',<84>,channel=1,2:0]
[@8,17:16='<INDENT>',<1>,2:4]
[@9,17:19='xxx',<79>,2:4]
[@10,20:20='=',<51>,2:7]
[@11,21:21='1',<80>,2:8]
[@12,22:23='\r\n',<82>,2:9]
[@13,24:27=' ',<84>,channel=1,3:0]
[@14,28:32='print',<27>,3:4]
[@15,33:33=' ',<84>,channel=1,3:9]
[@16,34:36='xxx',<79>,3:10]
[@17,37:38='\r\n',<82>,channel=1,3:13]
[@18,39:40='\r\n',<82>,4:0]
[@19,41:40='<DEDENT>',<2>,5:0]
[@20,41:43='def',<9>,5:0]
[@21,44:44=' ',<84>,channel=1,5:3]
[@22,45:49='greet',<79>,5:4]
[@23,50:50='(',<34>,5:9]
[@24,51:51=')',<37>,5:10]
[@25,52:52=':',<40>,5:11]
[@26,53:54='\r\n',<82>,5:12]
[@27,55:56=' ',<84>,channel=1,6:0]
[@28,57:56='<INDENT>',<1>,6:2]
[@29,57:61='print',<27>,6:2]
[@30,62:62=' ',<84>,channel=1,6:7]
[@31,63:75=''Hello World'',<81>,6:8]
[@32,76:77='\r\n',<82>,channel=1,6:21]
[@33,78:79=' ',<84>,channel=1,7:0]
[@34,80:81='\r\n',<82>,7:2]
[@35,82:81='<DEDENT>',<2>,8:0]
[@36,82:86='greet',<79>,8:0]
[@37,87:87='(',<34>,8:5]
[@38,88:88=')',<37>,8:6]
[@39,89:89=';',<42>,8:7]
[@40,90:89='<NEWLINE>',<82>,8:8]
[@41,90:89='<EOF>',<-1>,8:8]
According to the Official Python2 grammar, https://docs.python.org/2.7/reference/grammar.html, a funcdef
is funcdef: 'def' NAME parameters ':' suite
. It extends from the first character 'd'
of def
, and goes all the way to the last character of DEDENT
, since suite
is defined as suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
.
If you want to get the interval for the statements within function "test()", then you have to get the last char of the 2nd stmt
. It says there are two statements for function "test()":
$ trparse xxx.txt | trquery grep ' //stmt/compound_stmt/funcdef[NAME/text() = "test"]/suite/stmt' | trtext -c
CSharp 0 xxx.txt success 0.0428541
2
07/05-12:35:54 ~/issues/g4-new-csharp/python/python2_7_18/Generated-CSharp-0
$ trparse -l xxx.txt | trquery grep ' //stmt/compound_stmt/funcdef[NAME/text() = "test"]/suite/stmt[1]' | trcaret
CSharp 0 xxx.txt success 0.0425021
L2: xxx=1
^
07/05-12:36:00 ~/issues/g4-new-csharp/python/python2_7_18/Generated-CSharp-0
$ trparse -l xxx.txt | trquery grep ' //stmt/compound_stmt/funcdef[NAME/text() = "test"]/suite/stmt[2]' | trcaret
CSharp 0 xxx.txt success 0.0426761
L3: print xxx
^
The only thing that would be nice to change is the text for the INDENT and DEDENT tokens. They are <INDENT>
and <DEDENT>
respectively. But the text is inconsistent with the computed length of the token, which is end index - start index + 1 = 0
. So, for the first INDENT token, [@8,17:16='<INDENT>',<1>,2:4]
, the attributes of the token are:
The "problem" is on the trtext-side of things. trtext reconstructs the text of the input by concatenating the text of the leaves of the parse tree. So, I see <INDENT>
and <DEDENT>
sprinkled in the reconstructed text. I can easily remove these from the tree using trquery delete
.
Thanks for bringing this to my attention.
I really forgot about that.
In other words, the token stream must ensure that the original source code can be restored.
And this is not possible with "<INDENT>
" and "<DEDENT>
" token text.
I will fix it in all PythonLexerBase ports:
Thanks kaby76 and RobEin for checking on this issue. Waiting for your update if it is fixed in PythonLexerBase for Java.
On second thought, no repair is needed after all. The rule is very simple to restore the original source code by the token stream. You just have to take out the INDENT and DEDENT tokens. Python's tokenizer works differently. The INDENT and DEDENT tokens must be inserted there to restore the original code. I'm still wondering if there's any advantage to this, but probably not.
The rule is very simple to restore the original source code by the token stream[:] You just have to take out the INDENT and DEDENT tokens. ... The INDENT and DEDENT tokens must be inserted there to restore the original code.
I don't understand. These two statements are inconsistent. The first statement says that the INDENT and DEDENT tokens need to be deleted from the parse tree in order to reconstruct the source. The second statement says that they cannot be deleted because they are essential to reconstruct the source.
Currently, I have to delete the INDENT and DEDENT tokens to reconstruct the text because if I don't I get INDENT
and DEDENT
strings sprinkled in the reconstructed text, e.g., this:
07/07-08:05:20 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ trparse ../examples/atexit.py | trtext
CSharp 0 ../examples/atexit.py success 0.0601923
"""
atexit.py - allow programmer to define multiple exit functions to be executed
upon normal program termination.
One public function, register, is defined.
"""
__all__ = ["register"]
import sys
_exithandlers = []
def _run_exitfuncs():
<INDENT>"""run any registered exit functions
_exithandlers is traversed in reverse order so functions are executed
last in, first out.
"""
exc_info = None
while _exithandlers:
<INDENT>func, targs, kargs = _exithandlers.pop()
try:
<INDENT>func(*targs, **kargs)
<DEDENT>except SystemExit:
<INDENT>exc_info = sys.exc_info()
<DEDENT>except:
<INDENT>import traceback
print >> sys.stderr, "Error in atexit._run_exitfuncs:"
traceback.print_exc()
exc_info = sys.exc_info()
<DEDENT><DEDENT>if exc_info is not None:
<INDENT>raise exc_info[0], exc_info[1], exc_info[2]
<DEDENT><DEDENT>def register(func, *targs, **kargs):
<INDENT>"""register a function to be executed upon normal program termination
func - function to be called at exit
targs - optional arguments to pass to func
kargs - optional keyword arguments to pass to func
func is returned to facilitate usage as a decorator.
"""
_exithandlers.append((func, targs, kargs))
return func
<DEDENT>if hasattr(sys, "exitfunc"):
# Assume it's another registered exit function - append it to our list
<INDENT>register(sys.exitfunc)
<DEDENT>sys.exitfunc = _run_exitfuncs
if __name__ == "__main__":
<INDENT>def x1():
<INDENT>print "running x1"
<DEDENT>def x2(n):
<INDENT>print "running x2(%r)" % (n,)
<DEDENT>def x3(n, kwd=None):
<INDENT>print "running x3(%r, kwd=%r)" % (n, kwd)
<DEDENT>register(x1)
register(x2, 12)
register(x3, 5, "bar")
register(x3, "no kwd args")
<DEDENT>
07/07-08:05:40 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$
Text reconstruction in Trash follows the basic concept that existed in CS since the 1960's: the input text is simply the concatenation of the text of the frontier of the parse tree. The text for INDENT and DEDENT tokens are <INDENT>
and <DEDENT>
. This is why I need to either erase the text (which I currently cannot do with Trash), or the tokens need to be deleted from the parse tree, e.g.,:
07/07-07:59:04 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ trparse !$ | trquery 'delete //(DEDENT | INDENT)' | trtext
trparse ../examples/atexit.py | trquery 'delete //(DEDENT | INDENT)' | trtext
CSharp 0 ../examples/atexit.py success 0.0612294
"""
atexit.py - allow programmer to define multiple exit functions to be executed
upon normal program termination.
One public function, register, is defined.
"""
__all__ = ["register"]
import sys
_exithandlers = []
def _run_exitfuncs():
"""run any registered exit functions
_exithandlers is traversed in reverse order so functions are executed
last in, first out.
"""
exc_info = None
while _exithandlers:
func, targs, kargs = _exithandlers.pop()
try:
func(*targs, **kargs)
except SystemExit:
exc_info = sys.exc_info()
except:
import traceback
print >> sys.stderr, "Error in atexit._run_exitfuncs:"
traceback.print_exc()
exc_info = sys.exc_info()
if exc_info is not None:
raise exc_info[0], exc_info[1], exc_info[2]
def register(func, *targs, **kargs):
"""register a function to be executed upon normal program termination
func - function to be called at exit
targs - optional arguments to pass to func
kargs - optional keyword arguments to pass to func
func is returned to facilitate usage as a decorator.
"""
_exithandlers.append((func, targs, kargs))
return func
if hasattr(sys, "exitfunc"):
# Assume it's another registered exit function - append it to our list
register(sys.exitfunc)
sys.exitfunc = _run_exitfuncs
if __name__ == "__main__":
def x1():
print "running x1"
def x2(n):
print "running x2(%r)" % (n,)
def x3(n, kwd=None):
print "running x3(%r, kwd=%r)" % (n, kwd)
register(x1)
register(x2, 12)
register(x3, 5, "bar")
register(x3, "no kwd args")
07/07-07:59:38 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ trparse ../examples/atexit.py | trquery 'delete //(DEDENT | INDENT)' | trtext > save
CSharp 0 ../examples/atexit.py success 0.0600218
07/07-07:59:48 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
$ diff save ../examples/atexit.py
66d65
<
07/07-07:59:57 ~/issues/g4-current/python/python2_7_18/Generated-CSharp-0
NB: trtext outputs an extra newline character because it calls Console.WriteLine() instead of a Console.Write()
. It has to do this because dotnet programs don't work perfectly with a Cygwin/MSYS shell. Instead, one should use trsponge to perform the reconstruction and outputting.
The second statement says that they cannot be deleted because they are essential to reconstruct the source.
The second statement was about the original Python tokenizer.
... Trash follows the basic concept that existed in CS since the 1960's: the input text is simply the concatenation of the text of the frontier of the parse tree ...
Now I understand what the problem is. I didn't know this recommendation.
I can imagine two alternatives in this case:
Solution 1: The text of the INDENT/DEDENT tokens would contain the indentation similar to Python's tokenizer. Currently, the indentation text is stored in the WS tokens before the INDENT/DEDENT tokens. This is problematic because it may cause compatibility problems with older applications that use the PythonLexerBase class.
Solution 2:
This is simpler and less likely to cause compatibility issues.
That is, INDENT/DEDENT tokens would store an empty string.
Currently, the text property of INDENT tokens is consistently "<INDENT>"
and similarly that of DEDENT tokens is "<DEDENT>"
.
If these are now empty strings, then the text properties of the tokens should only be concatenated to restore the original source code.
This would be similar to deleting INDENT/DEDENT tokens.
I recommend the second solution.
I didn't understand. Can you explain me what has to be changed. Do I need to change in any grammar files?
We are trying to parse python 2x file using Java. When i tried to print the FuncdefContext.suite.getText() of test() function for this example,
def test():
xxx=1
print xxx
def greet():
print 'Hello World'
greet();
Output:
<INDENT>xxx=1
printxxx
<DEDENT>
and endline for this test() function is 5.
Can you tell me what should be done here to get the the correct endline.
tree.getText()
doesn't reconstruct the text of the input. It never does for virtually every Antlr grammar! This is because Antlr parse trees don't contain all the tokens of the input, like comments and white space, nor does it contain strings that are "skipped." Grammars that define lexer rules with -> skip
or -> channel(HIDDEN)
cause input strings to be not tokenized or tokenized with the channel property to be 1. The leaves in the parse tree don't contain these tokens. For python2_7_18, the DEDENT and INDENT tokens contain text as strings <DEDENT>
and <INDENT>
and these tokens are part of the Antlr parse tree. This is why you see tree.getText()
contain strings for the DEDENT and INDENT tokens. The "approved" way to get the text from an Antlr parse tree is to query the input char stream directly, using the parse tree to get the bounds of the indices of the text. See https://stackoverflow.com/a/55852474/4779853 or https://github.com/antlr/antlr4/issues/1302
Trash doesn't represent the parse tree like Antlr. It incorporates the entire input, including white space and comments. It's done this way so that it's fully serializable, with no loss of text, and fully editable. The way Antlr splits the parse tree from the token stream, and char stream, is unnatural, difficult/slow to serialize and edit.
Hi, Thanks for your response. I understand that you have suggested on how to get the text from ANTLR parse tree.
Our use case is to parse input python file and identify the startline and endline for each classes, functions, statements, comments, etc. in the file and while doing so we are facing an issue fetching endline from the function and statements context (for, while loop,...)
The easiest solution would be to just delete the INDENT and DEDENT leaves, then just get the Interval for the sub-tree. But, the Antlr runtime doesn't have tree editing.
Instead, do this:
1) Get the Interval of the node for the funcdef or stmt. The Interval is the start and end indices of the tokens for that sub-tree (i.e., not the start and end of the character buffer).
1) Write a loop to start at the ending token index. Working backwards, skip all INDENT and DEDENT tokens until you find something else, something that is not an INDENT or DEDENT. Do not backup further than the starting token index. We now have the end token index of the funcdef
or stmt
.
1) Get the end token from its end token index.
1) Get the end character index from the end token.
1) Write a loop that starts at end character index and looks at the character buffer. Stop looping when you find a character that is not a newline, character index of last non-newline for funcdef or stmt.
1) You can now return the 1+character index of last non-newline for funcdef or stmt
In C#:
var funcdefs = new Antlr4.Runtime.Tree.Xpath.XPath(parser, "//funcdef").Evaluate(tree);
var funcdef = funcdefs.FirstOrDefault();
var token_interval = funcdef.SourceInterval;
int end_token_index = token_interval.b;
for (; end_token_index >= token_interval.a; --end_token_index)
{
if (tokens.Get(end_token_index).Type != PythonParser.INDENT
&& tokens.Get(end_token_index).Type != PythonParser.DEDENT
&& tokens.Get(end_token_index).Type != PythonParser.WS
&& tokens.Get(end_token_index).Type != PythonParser.NEWLINE
&& tokens.Get(end_token_index).Channel == 0)
{
break;
}
}
var start_token = tokens.Get(token_interval.a);
var end_token = tokens.Get(end_token_index);
var start_char_index = start_token.StartIndex;
var end_char_index = end_token.StopIndex;
System.Console.WriteLine("funcdef text:");
System.Console.WriteLine(str.GetText(new Interval(start_char_index, end_char_index)));
[D]oes ANTLR python 2.7.18 grammars support python 2.6 version too?
I would think so, but don't quote me.
Hi, Thanks for you response. We will check on the suggestion you have provided as we have built in Java. Also, in our case we are using custom listener class to identify the endlines for each class, function, statement, etc. by overriding the base listener enter and exit methods.
For Example:
@Override
public void enterFuncdef(FuncdefContext ctx) {
int start = ctx.getStart.getLine();
int stop = ctx.getStop.getLine();
}
Would you like to suggest if we can handle the endlines correctly here?
[W]e are using custom listener class to identify the endlines for each class, function, statement, etc. by overriding the base listener enter and exit methods.
For Example:
@Override public void enterFuncdef(FuncdefContext ctx) { int start = ctx.getStart.getLine(); int stop = ctx.getStop.getLine(); }
Would you like to suggest if we can handle the endlines correctly here?
Not quite. Try this.
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.misc.*;
public class MyListener extends PythonParserBaseListener {
CommonTokenStream tokens_;
CharStream str_;
public MyListener(CommonTokenStream tokens, CharStream str)
{
tokens_ = tokens;
str_ = str;
}
@Override public void enterFuncdef(PythonParser.FuncdefContext ctx)
{
var start = ctx.getStart().getLine();
var token_interval = ctx.getSourceInterval();
var end_token_index = token_interval.b;
var tokens = this.tokens_;
var str = this.str_;
for (; end_token_index >= token_interval.a; --end_token_index)
{
if (tokens.get(end_token_index).getType() != PythonParser.INDENT
&& tokens.get(end_token_index).getType() != PythonParser.DEDENT
&& tokens.get(end_token_index).getType() != PythonParser.WS
&& tokens.get(end_token_index).getType() != PythonParser.NEWLINE
&& tokens.get(end_token_index).getChannel() == 0)
{
break;
}
}
var start_token = tokens.get(token_interval.a);
var end_token = tokens.get(end_token_index);
var start_char_index = start_token.getStartIndex();
var end_char_index = end_token.getStopIndex();
var stop_line_number = end_token.getLine();
System.out.println("stop = " + stop_line_number);
System.out.println("funcdef text:");
System.out.println(str.getText(new Interval(start_char_index, end_char_index)));
}
}
The "problem" is on the trtext-side of things. trtext reconstructs the text of the input by concatenating the text of the leaves of the parse tree. So, I see
<INDENT>
and<DEDENT>
sprinkled in the reconstructed text.
It looks like the text property of the inserted tokens is still necessary. In the left tree view, the text property of the inserted INDENT, DEDENT and NEWLINE tokens is as follows:
<INDENT>
<DEDENT>
<NEWLINE>
In the tree view on the right, there are empty strings for the inserted tokens, so that the original source code can be reconstructed with a simple concatenation. However, because of this, the tree view on the right becomes unreadable.
Furthermore, the text property of the EOF token generated by ANTLR is also not an empty string, but:
<EOF>
To recover the original source code with a simple concatenation, this should also be an empty string.
python example (there is no new line after the continue
statement):
if True:
continue
to show the tree view:
grun Python file_input -gui example.py
Furthermore, the text property of the EOF token generated by ANTLR is also not an empty string, but <EOF>
It might be a good idea to add to the readme.md comments on text recovery for the python grammars. E.g., for Trash, trparse x.txt | trquery delete ' //(INDENT | DEDENT | NEWLINE[text()="<NEWLINE>"])' | trsponge
.
It might be a good idea to add to the readme.md comments on text recovery for the python grammars.
Good idea, I support it.
However, wouldn't such a function belong in the CommonTokenStream class?
e.g. with the following method name:
GetRecoveredInputText()
Which method could be overridden in special cases in an inherited class.
In our case, e.g. with the following class name:
PythonCommonTokenStream
Also, I'm thinking that inserted tokens (including EOF) could get a separate token channel during tokenization.
eg: INSERTED_CHANNEL
Thus, the original source code would be just a simple concatenation, but omitting the tokens from the INSERTED_CHANNEL
.
I don't know if the Trash can filter by channel instead of delete.
I don't know if the Trash can filter by channel instead of delete.
I don't think so at the moment. Token text is addressed using a function text()
, e.g., ID/text()
would be the name of an ID. Off-channel tokens are addressed using @
-sign, e.g., @COMMENT
. You can even say @COMMENT/text()
and make queries about the comment containing certain text. But, I don't think I defined functions for channel--mainly because the engine is XPath 2.
ANTLR 4 does not recognize the end lines correctly. Below is the link to the grammar file we used with an example
Link: https://github.com/antlr/grammars-v4/tree/master/python/python2_7_18
3
start line for test is 1 but the end line is start line of another function which is 4