bodograumann / python-iconv

Python 3 wrapper for iconv and usage as codecs
GNU General Public License v3.0
7 stars 2 forks source link

Encoding fails when output is longer than input #2

Closed nwoods closed 3 years ago

nwoods commented 3 years ago

The following script raises an unexpected error:

#!/usr/bin/env python3

import iconvcodec

failmode_even = False

tm = '™'
if failmode_even:
    tm += ' '

tm_bin = tm.encode('ASCII//TRANSLIT')

print(tm_bin)

(in case GitHub doesn't support UTF-8 here, the character in the tm string is U+2122).

I'm using Python 3.8.5 on Ubuntu 20.04, with python-iconv 1.1.0 installed by pip.

There are two possible exceptions raised depending on whether the input string length is even or odd (change failmode_even in the example script to toggle them) but the underlying cause appears to be the same. Looks to me like a faulty (or maybe outdated?) assumption about the width of characters in Python strings but I wasn't able to figure out a fix after a few minutes of cursory exploration.

Exceptions:

Traceback (most recent call last):
  File "/home/nwoods/.local/lib/python3.8/site-packages/iconvcodec.py", line 10, in encode
    return encoder.iconv(msg), len(msg)
iconv.error: ('Argument list too long', 7, 0, b'')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nwoods/.local/lib/python3.8/site-packages/iconvcodec.py", line 17, in encode
    out1, len1 = encode(msg[inlen:], errors)
TypeError: slice indices must be integers or None or have an __index__ method

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./tmfail.py", line 7, in <module>
    tm_bin = tm.encode('ASCII//TRANSLIT')
TypeError: encoding with 'ASCII//TRANSLIT' codec failed (TypeError: slice indices must be integers or None or have an __index__ method)

or

Traceback (most recent call last):
  File "/home/nwoods/.local/lib/python3.8/site-packages/iconvcodec.py", line 10, in encode
    return encoder.iconv(msg), len(msg)
iconv.error: ('Argument list too long', 7, 3, b'(TM)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nwoods/.local/lib/python3.8/site-packages/iconvcodec.py", line 13, in encode
    assert inlen % 2 == 0
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./tmfail.py", line 11, in <module>
    tm_bin = tm.encode('ASCII//TRANSLIT')
AssertionError: encoding with 'ASCII//TRANSLIT' codec failed (AssertionError: )

Thanks for your help, let me know if there are any more details I can provide.

bodograumann commented 3 years ago

Thank you for reporting this issue. I cannot fix this quickly though.

When running locally with python 3.9, I cannot even execute the existing unit tests successfully. Loading the ASCII//TRANSLIT codec is not possible at all there. Though iconv on the command line works fine with it. (So I’m not able to reproduce the problem right now ^^) I’ll probably have to do some tests with pyenv / asdf and different python versions. The version I originally used was probably 3.7. Maybe there was some change in the c-api, but I can’t be sure.

If you could contribute something that would of course be highly appreciated. Setting up a local dev environment is as simple as:

python -m venv env
source env/bin/activate
pip install -e .
python -m unittest
nwoods commented 3 years ago

I'll see if I can find time to mess around with it in the next few days but no promises. I've never needed to use the buffer protocol before and after a quick glance at the documentation, I'm not overly optimistic that I'll be able to master it in a short timeframe. In particular, I didn't see an obvious way to translate the number of bytes converted into the number of characters converted. Maybe it's obvious to someone who has used that API before; I'll tinker a little more.

nwoods commented 3 years ago

I submitted a pull request that seems to fix the issue, along with a few other problems I noticed while messing around with it. I think the main problem was that the indexing and retries didn't change when the encode() implementation was changed to use byte strings internally.

bodograumann commented 3 years ago

Haha, auto-closed on merge :-D Thank you so much for your contribution.

Do you have other specific example strings that caused errors previously? I would like to add some more unit tests.

bodograumann commented 3 years ago

I published v1.1.1 with the fix to pypi.

nwoods commented 3 years ago

The original bug was in cases where the output string was longer than the input string. The easiest example is converting any string containing "™" (U+2122) from unicode (where it is 3 bytes) to ASCII//TRANSLIT, where it is rendered as "(TM)" (4 bytes). My other tinkering fixed bugs when errors was 'replace' or 'ignore'. I think any conversion that triggers those should do.