antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.11k stars 3.28k forks source link

An error occurred when antlr-parse is decoding the output. #4282

Open TheVeryDarkness opened 1 year ago

TheVeryDarkness commented 1 year ago

When using antlr-parse to parse a file in utf-8 with encoding set to utf-8, an error occurred like below:

Traceback (most recent call last):  File "D:\scoop\apps\python39\current\lib\runpy.py", line 197, in _run_module_as_main    return _run_code(code, main_globals, None,  File "D:\scoop\apps\python39\current\lib\runpy.py", line 87, in _run_code    exec(code, run_globals)  File "D:\scoop\apps\python39\current\Scripts\antlr4-parse.exe\__main__.py", line 7, in <module>  File "D:\scoop\apps\python39\current\lib\site-packages\antlr4_tool_runner.py", line 153, in interp    err = err.decode("UTF-8")UnicodeDecodeError: 'utf-8' codec can't decode byte  0xd5 in position 198: invalid continuation byte

It seems the actual encoding of output is gbk, while it's decoded with utf-8. I've tried chcp(I'm using Windows), but the error remains.

An example grammar file is

grammar test;

// Comment: '//'~[\n\r]* -> skip;
EmptyLine: [\n\r]+ -> skip;
Space: [ \t] -> skip;
Token: ~[ \t\n\r]+;

program: Token*;

An example input file is

任意的 Unicode 字符

Command is

antlr4-parse test.g4 program -tree -encoding utf-8 test.txt
TheVeryDarkness commented 1 year ago

Sorry, the log of error is hard to read. It should be

Traceback (most recent call last):
  File "D:\scoop\apps\python39\current\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\scoop\apps\python39\current\lib\runpy.py", line 87, in _run_code 
    exec(code, run_globals)
  File "D:\scoop\apps\python39\current\Scripts\antlr4-parse.exe\__main__.py", line 7, in <module>
  File "D:\scoop\apps\python39\current\lib\site-packages\antlr4_tool_runner.py", line 153, in interp
    err = err.decode("UTF-8") 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 198: invalid continuation byte
jimidle commented 1 year ago

have you tried just reading the file directly in Python, specifying utf-8 encoding? I suspect that your input file is not actually UTF-F 8.

On Tue, May 23, 2023 at 4:20 PM TheVeryDarkness @.***> wrote:

Sorry, the log of error is hard to read. It should be

Traceback (most recent call last): File "D:\scoop\apps\python39\current\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\scoop\apps\python39\current\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "D:\scoop\apps\python39\current\Scripts\antlr4-parse.exe__main__.py", line 7, in File "D:\scoop\apps\python39\current\lib\site-packages\antlr4_tool_runner.py", line 153, in interp err = err.decode("UTF-8") UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 198: invalid continuation byte

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4282#issuecomment-1558769205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMHVX2RH4VDR7Z55D6LXHRXLTANCNFSM6AAAAAAYLQIEIA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

TheVeryDarkness commented 1 year ago

have you tried just reading the file directly in Python, specifying utf-8 encoding? I suspect that your input file is not actually UTF-F 8.

On Tue, May 23, 2023 at 4:20 PM TheVeryDarkness @.***> wrote:

Sorry, the log of error is hard to read. It should be

Traceback (most recent call last): File "D:\scoop\apps\python39\current\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\scoop\apps\python39\current\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "D:\scoop\apps\python39\current\Scripts\antlr4-parse.exe__main__.py", line 7, in File "D:\scoop\apps\python39\current\lib\site-packages\antlr4_tool_runner.py", line 153, in interp err = err.decode("UTF-8") UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 198: invalid continuation byte

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4282#issuecomment-1558769205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMHVX2RH4VDR7Z55D6LXHRXLTANCNFSM6AAAAAAYLQIEIA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Thanks for your reply. And wait a minute, I'll post the decoding result later what you suggested. But the error occurred when decoding the output of popen() but not my input. It seems the sub-process reads UTF-8 but writes GBK.

TheVeryDarkness commented 1 year ago

The file can be read successfully as it shows below:

>>> open("test.txt", encoding="UTF-8").read()
'任意的 Unicode 字符'

And the error might occur at one of the last 2 lines below (a part of function interp in antlr4_tool_runner.py):

    p = subprocess.Popen([java, '-cp', jar, 'org.antlr.v4.gui.Interpreter']+args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, err = p.communicate()
    out = out.decode("UTF-8")
    err = err.decode("UTF-8")

So as I've tried chcp 65001, I'm wondering why popen() keeps giving outputs in GBK.