python3 ,Umlauts, utf-8 issue

erwinfrohsinn commented 7 years ago

I guess, there is an error in processing Umlauts (utf-8) in python as produced by antlr4 -Dlanguage=Python3.

How to reproduce: download JSON.g4, json2xml.py, t.json from here: https://github.com/jszheng/py3antlr4book/tree/master/08-JSON Verify that everything is OK without umlauts: python3 json2xml.py t.json # is o.k. copy t.json to t_uml.json and add a name with umlaut, so that line 5 now looks like this: "admin": ["parrt", "tombu", "jürgen"], The name jürgen looks like this in the hex editor: 6A C3 BC 72 67 65 6E , i.e. utf-8 compliant

python3 json2xml.py t_uml.json
Traceback (most recent call last):
  File "json2xml.py", line 64, in <module>
    input_stream = FileStream(sys.argv[1]) # Original
  File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 20, in __init__
    super().__init__(self.readDataFrom(fileName, encoding, errors))
  File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 27, in readDataFrom
    return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 169: ordinal not in range(128)

now try to decode utf-8 correctly. I replaced input_stream = FileStream(sys.argv[1]) by

        fp = codecs.open(sys.argv[1], 'rb', 'utf-8')    
        try:
            input_stream = fp.read()
        finally:
            fp.close()

which I found on stackoverflow.

python3 json2xml.py t_uml.json 
Traceback (most recent call last):
  File "json2xml.py", line 75, in <module>
    tree = parser.json()
  File "/media/sf_Entwicklung/antlr/08-JSON-Umlaut/JSONParser.py", line 112, in json
    self.enterRule(localctx, 0, self.RULE_json)
  File "/usr/local/lib/python3.5/dist-packages/antlr4/Parser.py", line 358, in enterRule
    self._ctx.start = self._input.LT(1)
  File "/usr/local/lib/python3.5/dist-packages/antlr4/CommonTokenStream.py", line 61, in LT
    self.lazyInit()
  File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit
    self.setup()
  File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 189, in setup
    self.sync(0)
  File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 111, in sync
    fetched = self.fetch(n)
  File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 123, in fetch
    t = self.tokenSource.nextToken()
  File "/usr/local/lib/python3.5/dist-packages/antlr4/Lexer.py", line 111, in nextToken
    tokenStartMarker = self._input.mark()
AttributeError: 'str' object has no attribute 'mark'

The same error occurs w/o umlauts, e.g.: python3 json2xml.py t.json

Now, to verify that the problem is not related to the grammar JSON.g4, I did:

antlr4 JSON.g4 
javac *.java
run JSON json -gui t_uml.json

which displayed the three names as an array, the umlaut was represented correctly. Conclusion: The grammar is o.k., but there is a problem in the generated python modules

ericvergnaud commented 7 years ago

Hi, The place for support is the google group. Eric

Envoyé de mon iPhone

Le 29 mai 2017 à 14:41, erwinfrohsinn notifications@github.com a écrit :

I guess, there is an error in processing Umlauts (utf-8) in python as produced by antlr4 -Dlanguage=Python3.

How to reproduce: download JSON.g4, json2xml.py, t.json from here: https://github.com/jszheng/py3antlr4book/tree/master/08-JSON Verify that everything is OK without umlauts: python3 json2xml.py t.json # is o.k. copy t.json to t_uml.json and add a name with umlaut, so that line 5 now looks like this: "admin": ["parrt", "tombu", "jürgen"], The name jürgen looks like this in the hex editor: 6A C3 BC 72 67 65 6E , i.e. utf-8 compliant

python3 json2xml.py t_uml.json Traceback (most recent call last): File "json2xml.py", line 64, in input_stream = FileStream(sys.argv[1]) # Original File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 20, in init super().init(self.readDataFrom(fileName, encoding, errors)) File "/usr/local/lib/python3.5/dist-packages/antlr4/FileStream.py", line 27, in readDataFrom return codecs.decode(bytes, encoding, errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 169: ordinal not in range(128)

now try to decode utf-8 correctly. I replaced input_stream = FileStream(sys.argv[1]) by
    fp = codecs.open(sys.argv[1], 'rb', 'utf-8')    
    try:
        input_stream = fp.read()
    finally:
        fp.close()
which I found on stackoverflow.

python3 json2xml.py t_uml.json Traceback (most recent call last): File "json2xml.py", line 75, in tree = parser.json() File "/media/sf_Entwicklung/antlr/08-JSON-Umlaut/JSONParser.py", line 112, in json self.enterRule(localctx, 0, self.RULE_json) File "/usr/local/lib/python3.5/dist-packages/antlr4/Parser.py", line 358, in enterRule self._ctx.start = self._input.LT(1) File "/usr/local/lib/python3.5/dist-packages/antlr4/CommonTokenStream.py", line 61, in LT self.lazyInit() File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit self.setup() File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 189, in setup self.sync(0) File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 111, in sync fetched = self.fetch(n) File "/usr/local/lib/python3.5/dist-packages/antlr4/BufferedTokenStream.py", line 123, in fetch t = self.tokenSource.nextToken() File "/usr/local/lib/python3.5/dist-packages/antlr4/Lexer.py", line 111, in nextToken tokenStartMarker = self._input.mark() AttributeError: 'str' object has no attribute 'mark'

The same error occurs w/o umlauts, e.g.: python3 json2xml.py t.json

Now, to verify that the problem is not related to the grammar JSON.g4, I did:

antlr4 JSON.g4 javac *.java run JSON json -gui t_uml.json which displayed the three names as an array, the umlaut was represented correctly. Conclusion: The grammar is o.k., but there is a problem in the generated python modules

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

BurtHarris commented 7 years ago

In the Java target, it seems there has been some confusion about the meaning of UTF-8. Seee #1899.

I don't know much about Python, but perhaps the same thing is true for that target: what's documented as UTF-8 support might really be UTF-16 based.

nturley commented 6 years ago

I think that the problem is that you are passing a string into the constructor of your Lexer instead of a FileStream. The constructor assigns _input to the poorly-named argument "input" (which shadows the name of a python built-in function). FileStream objects have a method named "mark" but strings do not which is why it raises the Exception.

ThirtySomething commented 3 years ago

I used the C.g4 grammar and generated Python classes from it. Then I've been passing a C source containing a german umlaut 'ö'. The FileStream class of the antlr4 package cannot handle this and aborts. Obviously you can pass an additional argument for encoding, but it seems that FileStream won't care.

My python main.py:

from antlr4 import *
from grammar.CListener import CListener
from grammar.CLexer import CLexer
from grammar.CParser import CParser
import sys

def main():
    # input_stream = FileStream(sys.argv[1])
    input_stream = FileStream(sys.argv[1], 'utf-8')
    lexer = CLexer(input_stream)
    stream = CommonTokenStream(lexer)
    parser = CParser(stream)
    tree = parser.expression()
    # handleExpression(tree)

if __name__ == '__main__':
    main()

The traceback:

Traceback (most recent call last):
  File "<some path>\main.py", line 18, in <module>
    main()
  File "<some path>\main.py", line 9, in main
    input_stream = FileStream(sys.argv[1])
  File "C:\Users\<USER>\AppData\Local\Programs\Python\Python39\lib\site-packages\antlr4\FileStream.py", line 20, in __init__
    super().__init__(self.readDataFrom(fileName, encoding, errors))
  File "C:\Users\<USER>\AppData\Local\Programs\Python\Python39\lib\site-packages\antlr4\FileStream.py", line 27, in readDataFrom
    return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 14955: ordinal not in range(128)

I'm using Python 3.9.5 on Windows 10.

ericvergnaud commented 3 years ago

@ThirtySomething can you check whether encoding == 'utf-8' when reaching line 27 of FileStream.py ?

ThirtySomething commented 3 years ago

@ericvergnaud I'm sorry, but I don't know how to debug into FileStream.py - I've just installed it using pip install .... Yes, I'm able to open the code. But the breakpoints there are still ignored. When I try to step in, VSCode immediately come back with the exception. But... I assume that the reason is the file itself is a ISO-8859-1 encoded file, not an UTF-8 encoded one. So the UTF-8 detection may fail.

antlr / antlr4

python3 ,Umlauts, utf-8 issue #1888