Unicode support somewhat broken

byroot / pysrt

Python parser for SubRip (srt) files

GNU General Public License v3.0

446 stars 67 forks source link

Unicode support somewhat broken #50

Closed felagund closed 10 years ago

felagund commented 10 years ago

Running

 a = pysrt.open('tests/static/utf-8.srt')
print a[1]

Gives me:


<ipython-input-11-e9d376687425> in <module>()
----> 1 print a[1]

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 51: ordinal not in range(128)```

Running `print a[1].__str__()` is fine though, which puzzles me. When dealing with Unicode, I gather that `str(text)` and `str(poition)` in the init method of SubRipItem should not be used, but I am not so sure. I tried fixing this, but since pysrt wants to support both python 2 and 3, I am not sure how to go about it and failed miserably so far. Do you have any suggestings?

felagund commented 10 years ago

I read on the internet that str() should not be used when dealing with unicode and .encode() and .decode() should be used. This poses two problems: 1) Python 3 compatibility 2) SrtItem does not know about encoding of its .text attribute as far as I can see

byroot commented 10 years ago

SrtItem does not know about encoding of its .text attribute as far as I can see

Yes it's by design. SubRipFile deal with all the encoding mess, but SubRipItem is fully unicode and abstracted from encoding stuffs. So SubRipItem.str() in Python 2 do not make sense. Actually before the merge of py2 and 3 code bases that method did not exist.

I guess we have no other choices than to have two different implementations, and to conditionally define them based on python version, something like:

class SubRipItem:
    if is_py3:
        def __str__(self):
            #actual implementation
    else:
        def __unicode__(self):
            # same implementation

        def __str__(self):
            raise NotImplementedError

felagund commented 10 years ago

Would not it be better if __str__(self) pointed to unicode(self)? Only we would need to know which encoding to use, right, which we don't by design, if I understand it correctly.

Then, should not

self.position = str(position)
self.text = str(text)

in __init__ method of SubRipItem be

self.position = unicode(position)
self.text = unicode(text)

instead?

And in write_into method of SubRipFile, we will need to call either unicode(item) or str(item) based on whether we are using Python 2 or 3, right?

byroot commented 10 years ago

we will need to call either unicode(item) or str(item) based on whether we are using Python 2 or 3, right?

It's already the case: https://github.com/byroot/pysrt/blob/master/pysrt/compat.py#L18