Open erwinfrohsinn opened 7 years ago
Hi, The place for support is the google group. Eric
Envoyé de mon iPhone
Le 29 mai 2017 à 14:41, erwinfrohsinn notifications@github.com a écrit :
I guess, there is an error in processing Umlauts (utf-8) in python as produced by antlr4 -Dlanguage=Python3.
How to reproduce: download JSON.g4, json2xml.py, t.json from here: https://github.com/jszheng/py3antlr4book/tree/master/08-JSON Verify that everything is OK without umlauts: python3 json2xml.py t.json # is o.k. copy t.json to t_uml.json and add a name with umlaut, so that line 5 now looks like this: "admin": ["parrt", "tombu", "jürgen"], The name jürgen looks like this in the hex editor: 6A C3 BC 72 67 65 6E , i.e. utf-8 compliant
python3 json2xml.py t_uml.json Traceback (most recent call last): File "json2xml.py", line 64, in
input_stream = FileStream(sys.argv[1]) # Original File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 20, in init super().init(self.readDataFrom(fileName, encoding, errors)) File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 27, in readDataFrom return codecs.decode(bytes, encoding, errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 169: ordinal not in range(128) now try to decode utf-8 correctly. I replaced input_stream = FileStream(sys.argv[1]) by
fp = codecs.open(sys.argv[1], 'rb', 'utf-8') try: input_stream = fp.read() finally: fp.close()
which I found on stackoverflow.
python3 json2xml.py t_uml.json Traceback (most recent call last): File "json2xml.py", line 75, in
tree = parser.json() File "/media/sf_Entwicklung/antlr/08-JSON-Umlaut/JSONParser.py", line 112, in json self.enterRule(localctx, 0, self.RULE_json) File "/usr/local/lib/python3.5/dist-packages/antlr4/Parser.py", line 358, in enterRule self._ctx.start = self._input.LT(1) File "/usr/local/lib/python3.5/dist-packages/antlr4/CommonTokenStream.py", line 61, in LT self.lazyInit() File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit self.setup() File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 189, in setup self.sync(0) File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 111, in sync fetched = self.fetch(n) File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 123, in fetch t = self.tokenSource.nextToken() File "/usr/local/lib/python3.5/dist-packages/antlr4/Lexer.py", line 111, in nextToken tokenStartMarker = self._input.mark() AttributeError: 'str' object has no attribute 'mark' The same error occurs w/o umlauts, e.g.: python3 json2xml.py t.json
Now, to verify that the problem is not related to the grammar JSON.g4, I did:
antlr4 JSON.g4 javac *.java run JSON json -gui t_uml.json which displayed the three names as an array, the umlaut was represented correctly. Conclusion: The grammar is o.k., but there is a problem in the generated python modules
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
In the Java target, it seems there has been some confusion about the meaning of UTF-8. Seee #1899.
I don't know much about Python, but perhaps the same thing is true for that target: what's documented as UTF-8 support might really be UTF-16 based.
I think that the problem is that you are passing a string into the constructor of your Lexer instead of a FileStream. The constructor assigns _input to the poorly-named argument "input" (which shadows the name of a python built-in function). FileStream objects have a method named "mark" but strings do not which is why it raises the Exception.
I used the C.g4 grammar and generated Python classes from it. Then I've been passing a C source containing a german umlaut 'ö'. The FileStream class of the antlr4 package cannot handle this and aborts. Obviously you can pass an additional argument for encoding, but it seems that FileStream won't care.
My python main.py:
from antlr4 import *
from grammar.CListener import CListener
from grammar.CLexer import CLexer
from grammar.CParser import CParser
import sys
def main():
# input_stream = FileStream(sys.argv[1])
input_stream = FileStream(sys.argv[1], 'utf-8')
lexer = CLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = CParser(stream)
tree = parser.expression()
# handleExpression(tree)
if __name__ == '__main__':
main()
The traceback:
Traceback (most recent call last):
File "<some path>\main.py", line 18, in <module>
main()
File "<some path>\main.py", line 9, in main
input_stream = FileStream(sys.argv[1])
File "C:\Users\<USER>\AppData\Local\Programs\Python\Python39\lib\site-packages\antlr4\FileStream.py", line 20, in __init__
super().__init__(self.readDataFrom(fileName, encoding, errors))
File "C:\Users\<USER>\AppData\Local\Programs\Python\Python39\lib\site-packages\antlr4\FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 14955: ordinal not in range(128)
I'm using Python 3.9.5 on Windows 10.
@ThirtySomething can you check whether encoding == 'utf-8' when reaching line 27 of FileStream.py ?
@ericvergnaud I'm sorry, but I don't know how to debug into FileStream.py - I've just installed it using pip install ...
. Yes, I'm able to open the code. But the breakpoints there are still ignored. When I try to step in, VSCode immediately come back with the exception. But... I assume that the reason is the file itself is a ISO-8859-1 encoded file, not an UTF-8 encoded one. So the UTF-8 detection may fail.
I guess, there is an error in processing Umlauts (utf-8) in python as produced by antlr4 -Dlanguage=Python3.
How to reproduce: download JSON.g4, json2xml.py, t.json from here: https://github.com/jszheng/py3antlr4book/tree/master/08-JSON Verify that everything is OK without umlauts:
python3 json2xml.py t.json # is o.k.
copy t.json to t_uml.json and add a name with umlaut, so that line 5 now looks like this: "admin": ["parrt", "tombu", "jürgen"], The name jürgen looks like this in the hex editor: 6A C3 BC 72 67 65 6E , i.e. utf-8 compliantnow try to decode utf-8 correctly. I replaced
input_stream = FileStream(sys.argv[1])
bywhich I found on stackoverflow.
The same error occurs w/o umlauts, e.g.:
python3 json2xml.py t.json
Now, to verify that the problem is not related to the grammar JSON.g4, I did:
which displayed the three names as an array, the umlaut was represented correctly. Conclusion: The grammar is o.k., but there is a problem in the generated python modules