Diaoul / subliminal

Subtitles, faster than your thoughts
http://subliminal.readthedocs.org
MIT License
2.4k stars 311 forks source link

Proper encoding detection #528

Open Diaoul opened 8 years ago

Diaoul commented 8 years ago

For proper encoding detection a rock-solid test suite is mandatory. This issue aims to gather real world test cases for every languages. Please provide links to subtitles and give their correct encoding.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

ernyldrm commented 8 years ago

I confirm that as of today, with version 1.0.1 all Turkish subtitles are encoded wrong. I can't correctly see ğ,ş and ı characters. When I open the subtitle with sublime text 2 or textedit it shows up wrong as well. May I ask if it may have something to do with my locale settings? My "locale -a" output is like this:

C
C.UTF-8
en_US.utf8
POSIX
tr_TR
tr_TR.iso88599
tr_TR.utf8
turkish

and "locale" output is like this:

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="tr_TR.UTF-8"
LC_NUMERIC="tr_TR.UTF-8"
LC_TIME="tr_TR.UTF-8"
LC_COLLATE="tr_TR.UTF-8"
LC_MONETARY="tr_TR.UTF-8"
LC_MESSAGES="tr_TR.UTF-8"
LC_PAPER="tr_TR.UTF-8"
LC_NAME="tr_TR.UTF-8"
LC_ADDRESS="tr_TR.UTF-8"
LC_TELEPHONE="tr_TR.UTF-8"
LC_MEASUREMENT="tr_TR.UTF-8"
LC_IDENTIFICATION="tr_TR.UTF-8"
LC_ALL=tr_TR.UTF-8

Thanks for bringing this up. I'm a coder, but not python. How can I help?

Diaoul commented 8 years ago

Please provide links to subtitles and give their correct encoding.

Chimerique commented 8 years ago

Hi http://dl.opensubtitles.org/en/download/sub/6367306 This should be windows-1252

pannal commented 8 years ago

I've added correct Windows-1250 and 1251 detection to Sub-Zero, as well as support for other formats than SRT, normalizing to SRT etc.

@Diaoul perhaps you want to take a look at this.

Diaoul commented 8 years ago

Thanks I will grab that. I'm working on a feature that will abstract subtitle file format and will be able to convert it to various formats. I will push as soon as I have something viable.