byroot / pysrt

Python parser for SubRip (srt) files
GNU General Public License v3.0
454 stars 69 forks source link

Problem with SubRipFile.from_string #4

Closed keul closed 13 years ago

keul commented 13 years ago

I've a problem with the from_string API; an Unicode error I'm not able to fix.

The file is there (but for application reason I can't use the SubRipFile.open method): http://releases.flowplayer.org/data/buffalo.srt

Some tested examples:

from pysrt import SubRipFile p = '/Users/luca/Documents/buffalo.srt' SubRipFile.open(p) Traceback (most recent call last): File "", line 1, in ? File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 81, in open source = unicode(string_buffer.read(), new_file.encoding) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 47-48: invalid data SubRipFile.open(p, encoding='latin1') [... THIS IS OK, IT WORKS ...] st = open(p).read() SubRipFile.from_string(st) Traceback (most recent call last): File "", line 1, in ? File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 107, in from_string return cls.open(file_descriptor=StringIO(source)) File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 81, in open source = unicode(string_buffer.read(), new_file.encoding) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 49-50: invalid data SubRipFile.from_string(st.decode('iso-8859-1')) Traceback (most recent call last): File "", line 1, in ? File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 107, in from_string return cls.open(file_descriptor=StringIO(source)) File "/Users/luca/Library/Buildout/eggs/pysrt-0.2.4-py2.4.egg/pysrt/srtfile.py", line 81, in open source = unicode(string_buffer.read(), new_file.encoding) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 17348-17349: invalid data SubRipFile.from_string(st.decode('iso-8859-1').encode('utf-8')) []

Any tips? right now I can skip this problem using a temp file, but however it seems there are some problem in the method.

keul commented 13 years ago

Uhmmm... lets re-post the test

http://pastie.org/1411797

byroot commented 13 years ago

Since you've closed the ticket, I suppose you've find the answer.

SubRipFile.open can take an optional named argument "encoding", default to 'utf-8'.

But most subtitles are encoded in iso-8852-1 or worst in cp1252 so to open your file:

SubRipFile.open('buffallo.srt', encoding='cp1252')
keul commented 13 years ago

I've closed the ticket by mistake I think!

I didn't find any solution. I was aware of the "encoding" parameter for the .open method, but the problem if with the .from_string.

byroot commented 13 years ago

Oh sorry, I've read your ticket too quickly.

The from_string method was developed from testing purpose, if you want to use it I should make it accept encoding and eol arguments.

Until I fix that, if you need to parse a string you can do that: SubRipFile.open(file_descriptor=StringIO(open('buffalo.srt').read().replace('\r\n', '\n')), encoding='cp1252')

keul commented 13 years ago

Your example helped me! The problems seems the '\r\n' character (maybe you can handle this internally)?

Now I'm able to not use a temp file, but simply .open and filedescription parameter!

For me this is enough, thanks again.

PS: maybe you can enanche a little the pypi page with those additional examples. This way of creating a subrip is really interesting.

byroot commented 13 years ago

I've just pushed a new 0.2.5 version. Now SubRipFile.from_string support encoding and eol arguments so you can do:

SubRipFile.from_string(st, encoding='cp1252', eol='\r\n')

And you're right I should write a documentation but you know ...