byroot / pysrt

Python parser for SubRip (srt) files
GNU General Public License v3.0
449 stars 67 forks source link

Encode error #12

Closed limpbrains closed 11 years ago

limpbrains commented 12 years ago

I can't run srt with this file http://dl.dropbox.com/u/1788271/Bones.S07E01.HDTVRip.srt It is cp1251 I have the following error:

Traceback (most recent call last):
  File "/usr/local/bin/srt", line 9, in <module>
    load_entry_point('pysrt==0.4.1', 'console_scripts', 'srt')()
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 190, in main
    SubRipShifter().run(sys.argv[1:])
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 118, in run
    self.arguments.action()
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 164, in break_lines
    self.input_file.break_lines(self.arguments.length)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 177, in input_file
    encoding=encoding, error_handling=SubRipFile.ERROR_LOG)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtfile.py", line 131, in open
    new_file.read(source_file, error_handling=error_handling)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtfile.py", line 159, in read
    self.extend(self.stream(source_file, error_handling=error_handling))
  File "/usr/lib/python2.7/UserList.py", line 88, in extend
    self.data.extend(other)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtfile.py", line 190, in stream
    yield SubRipItem.from_lines(source)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtitem.py", line 79, in from_lines
    return cls(index, start, end, body, position)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtitem.py", line 21, in __init__
    self.index = int(index)
UnicodeEncodeError: 'decimal' codec can't encode character u'\ufeff' in position 0: invalid decimal Unicode string
byroot commented 12 years ago

Strange, I'm able to shift it without encoding error.

srt shift 20 russian.srt

Can you paste the whole command you typed ?

byroot commented 12 years ago

Well, a month without reply -> I close this issue.

Feel free to reopen it if you still have a problem.

limpbrains commented 12 years ago

Hi, sorry for the long responce

srt shift 40s 33.srt Traceback (most recent call last): File "/usr/local/bin/srt", line 9, in load_entry_point('pysrt==0.4.1', 'console_scripts', 'srt')() File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/commands.py", line 192, in main SubRipShifter().run(sys.argv[1:]) File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/commands.py", line 118, in run self.arguments.action() File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/commands.py", line 136, in shift self.input_file.shift(milliseconds=self.arguments.time_offset) File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/commands.py", line 179, in input_file encoding=encoding, error_handling=SubRipFile.ERROR_LOG) File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtfile.py", line 127, in open new_file.read(source_file, error_handling=error_handling) File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtfile.py", line 155, in read self.extend(self.stream(source_file, error_handling=error_handling)) File "/usr/lib/python2.7/UserList.py", line 88, in extend self.data.extend(other) File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtfile.py", line 186, in stream yield SubRipItem.from_lines(source) File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtitem.py", line 58, in from_lines return cls(index, start, end, body, position) File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtitem.py", line 21, in init self.index = int(index) UnicodeEncodeError: 'decimal' codec can't encode character u'\ufeff' in position 0: invalid decimal Unicode string

python -V Python 2.7.2+

lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 11.10 Release: 11.10 Codename: oneiric

byroot commented 12 years ago

Hum, very strange... so it always happen whatever the subtitle file ?

And how did you installed it ? Beacause /data/share/_films/Game of Thrones_S02E02/src/ is a very strange location...

limpbrains commented 12 years ago

I've only tried on a few files, all russian, UTF8. installed from git pip install -e git+https://github.com/byroot/pysrt.git#egg=pysrt

byroot commented 12 years ago

Ok, I still can't reproduce but now I'm almost sure that it's a BOM issue...

I will ask a friend on ubuntu to test that

Did you tried the version released on PyPI ? pip install --upgrade pysrt

limpbrains commented 12 years ago

I confirm it is a BOM issue. I've successfully edited file without BOM created with notepad++ also I've tried the following command srt -e utf_8_sig ... but failed with same error

byroot commented 12 years ago

Pysrt is supposed to handle BOM correctly...

And the file you gived to me is in cp1252, why did it have an utf-8 BOM ? Can you send me another file again ?

Diaoul commented 11 years ago

I'm having the same issue File is here: https://docs.google.com/open?id=0B2q9iBGZdj6qN29uUzBBQXNJM2c

byroot commented 11 years ago

I finally found the issue, it was because chardet returned "UTF-8" and the encodings module was only aware of "utf-8".

My bad ...

Diaoul commented 11 years ago

Is this fixed in 0.4.4? Because I still have this error

byroot commented 11 years ago

I Think so. You still have the issue with this same file and pysrt 0.4.4 ?

byroot commented 11 years ago

Oh shit ... confirmed, I'll fix that right now.

byroot commented 11 years ago

Oh, I just forgot to release ...

byroot commented 11 years ago

0.4.5 released with the fix.

Diaoul commented 11 years ago

Thanks, that was fast :)

Diaoul commented 11 years ago

I'm still having an error :cry: I added a print statement to see what's in lines here and I got this:

[u'\ufeff1\r\n', u'00:00:01,677 --> 00:00:04,145\r\n', u'Alors, sur quel genre de croisi\xe8re\r\n', u'allez-vous embarquer ?\r\n']
Diaoul commented 11 years ago

Of course int(u'\ufeff1\r\n') fails File can be downloaded on Addic7ed

Diaoul commented 11 years ago

Sample code to reproduce the error:

from charade.universaldetector import UniversalDetector
import codecs
import pysrt

def is_valid_subtitle(path):
    u = UniversalDetector()
    for line in open(path, 'rb'):
        u.feed(line)
    u.close()
    encoding = u.result['encoding']
    source_file = codecs.open(path, 'rU', encoding=encoding, errors='replace')
    try:
        for _ in pysrt.SubRipFile.stream(source_file, error_handling=pysrt.SubRipFile.ERROR_RAISE):
            pass
    except pysrt.Error as e:
        if e.args[0] < 50:  # Error occurs within the 50 first lines
            return False
#    except UnicodeEncodeError:  # Workaround for https://github.com/byroot/pysrt/issues/12
#        pass
    return True
byroot commented 11 years ago

Oh ! it make sense now. If you open the file yourself pysrt do not strip the BOM.

Anyway chardet is integrated inside pysrt now.

Try something like:

def is_valid_subtitle(path):
    source_file = pysrt.SubRipFile._open_unicode_file(path)
    try:
        for _ in pysrt.SubRipFile.stream(source_file, error_handling=pysrt.SubRipFile.ERROR_RAISE):
            pass
    except pysrt.Error as e:
        if e.args[0] < 50:  # Error occurs within the 50 first lines
            return False
#    except UnicodeEncodeError:  # Workaround for https://github.com/byroot/pysrt/issues/12
#        pass
    return True