Carets/offsets are wrong for unicode characters

ammaraskar commented 3 years ago

Using the code:

# -*- coding: utf-8 -*-

1 + "Ĥellö Wörld"

leads to

Traceback (most recent call last):
  File "C:\Users\ammar\junk\test.py", line 3, in <module>
    1 + "Ĥellö Wörld"
    ^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'int' and 'str'

pablogsal commented 3 years ago

Maybe we need to use something like https://github.com/python/cpython/blob/main/Parser/pegen.c#L143

but on the other hand, we handle this correctly for syntax errors:

>>> f(4, "Ĥellö Wörld" for x in range(1))
  File "<stdin>", line 1
    f(4, "Ĥellö Wörld" for x in range(1))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized

and the numbers used are the same, in reality.

ammaraskar commented 3 years ago

Huh I wonder if there is some magic going on in the syntax error handling because the AST column offsets are definitely byte-based rather than character.

> python -m ast -a C:\Users\ammar\junk\test.py
Module(
   body=[
      Expr(
         value=BinOp(
            left=Constant(
               value=1,
               lineno=3,
               col_offset=0,
               end_lineno=3,
               end_col_offset=1),
            op=Add(),
            right=Constant(
               value='Ĥellö Wörld',
               lineno=3,
               col_offset=4,
               end_lineno=3,
               end_col_offset=20),
            lineno=3,
            col_offset=0,
            end_lineno=3,
            end_col_offset=20),
         lineno=3,
         col_offset=0,
         end_lineno=3,
         end_col_offset=20)],
   type_ignores=[])

pablogsal commented 3 years ago

Huh I wonder if there is some magic going on in the syntax error handling because the AST column offsets are definitely byte-based rather than character.

Surprise surprise:

https://github.com/python/cpython/blob/fdc7e52f5f1853e350407c472ae031339ac7f60c/Parser/pegen.c#L496-L501

ammaraskar commented 3 years ago

So there's a weird edge case here around non-utf8 encoded source files that we need to consider if we want to add special handling for:

Consider a file like:

# -*- coding: cp437 -*-

f(4, "Hëllo World τΦΘ" for x in range(1))

Now in the compiler I think this ends up magically getting handled because the tokenizer re-encodes the source file lines to utf8. When it goes into the pegen error handling code, this call to PyErr_ProgramTextObject ends up failing because this line fails to parse the line properly as utf-8. This causes the fallback logic for getting the line to be invoked which decodes the line as utf-8 which is valid since the tokenizer re-encoded it to utf8 in the buffer. Thus the assumption in byte_offset_to_character_offset that the line is utf-8 holds true and it all works.

(As an aside I think there's a bug here if we use a custom # -*- coding: that encodes characters differently to utf-8 but still parses as valid utf-8).

However, this case is a bit trickier to handle in the traceback machinery. In the traceback code we really only know the filename and the offsets, we can't rely on the tokenizer to have done the work of figuring out the proper source encoding for the file. I wonder if we should just assume it's UTF8 and maybe decode in the replace mode since this is such an edge case or actually handle it properly.

ammaraskar commented 3 years ago

(As an aside I think there's a bug here if we use a custom # -*- coding: that encodes characters differently to utf-8 but still parses as valid utf-8).

I was able to trigger this bug:

# -*- coding: cp437 -*-

f(4, '┬ó┬ó┬ó┬ó┬ó┬ó' for x in range(1)) # Test

  File "test-weird-encoding.py", line 3
    f(4, '¢¢¢¢¢¢' for x in range(1)) # Test
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Generator expression must be parenthesized

Let me make a bpo issue for it.

pablogsal commented 3 years ago

Excellent find @ammaraskar. This is going to be a bit of a pain to fix, I am afraid :(

pablogsal commented 3 years ago

This also affects older versions of the interpreter, is basically a bug in PyErr_ProgramTextObject:

  File "/Users/pgalindo3/github/cpython/lel.py", line 3
    f('┬ó┬ó┬ó┬ó┬ó┬ó', 4 for x in range(1)) # Test
                                                                ^
SyntaxError: Generator expression must be parenthesized

ammaraskar commented 3 years ago

bpo issue created: https://bugs.python.org/issue44349

ammaraskar commented 3 years ago

I think for our case in the traceback, since this is such a weird edge case we should probably just assume the input is utf8 and ignore/replace decoding errors? @isidentical would it be possible to quickly check with your AST tooling how many python packages use a custom # -*- coding: line that isn't utf8?

pablogsal commented 3 years ago

I think for our case in the traceback, since this is such a weird edge case we should probably just assume the input is utf8 and ignore/replace decoding errors?

I very much think so, the reason is that the better we can do is print the error as utf8 and that is still slighly weird, but is consistent with the rest of the interpreter. That is what I did here:

https://github.com/python/cpython/pull/26611

pablogsal commented 3 years ago

@isidentical would it be possible to quickly check with your AST tooling how many python packages use a custom # -*- coding: line that isn't utf8?

Also, notice that the problem happens when the encoding is not utf8 and the line with the error has something tha is decodeable as utf8. But this happens also with older versions when reporting any error with caret, as the initial position will also be wrong.

On the other hand, we could add some code to make sure we are decoding a utf-8 file by checking the BOM and don't show the caret if that's not true.

ammaraskar commented 3 years ago

However, this case is a bit trickier to handle in the traceback machinery.

Aha, I spoke too soon! The traceback machinery is already set up to detect the encoding so we are all set here 😅 https://github.com/python/cpython/blob/main/Python/traceback.c#L414

colnotab / cpython

Carets/offsets are wrong for unicode characters #10