UTF Characters in Movie title.

brandonganem commented 7 years ago

$ python2.7 autorippr.py --all --debug 2016-10-26 21:20:09 - Rip - DEBUG - Ripping initialised 2016-10-26 21:20:09 - Rip - DEBUG - Checking for DVDs 2016-10-26 21:20:16 - Rip - DEBUG - 1 DVD(s) found 2016-10-26 21:20:16 - Makemkv - DEBUG - Detected movie Les Miserables Dom 2016-10-26 21:20:41 - Makemkv - DEBUG - MakeMKV found 1 titles 2016-10-26 21:20:41 - Makemkv - DEBUG - MakeMKV title info: Disc Title: ['Les Mis\xc3\xa9rables'], Title No.: 0, Title: ['Les_Mis\xc3\xa9rables_t00.mkv'], 2016-10-26 21:20:41 - Rip - DEBUG - Attempting to rip Les_Misérables_t00.mkv from Les Miserables Dom 2016-10-26 21:50:48 - Rip - INFO - It took 30 minute(s) to complete the ripping of Les_Misérables_t00.mkv from Les Miserables Dom 2016-10-26 21:50:48 - Eject - DEBUG - Ejecting drive: "/dev/sr0" 2016-10-26 21:50:48 - Eject - DEBUG - Attempting OS detection 2016-10-26 21:50:48 - Eject - DEBUG - OS detected as Unix 2016-10-26 21:50:52 - Eject - DEBUG - eject: device name is `/dev/sr0' 2016-10-26 21:50:52 - Eject - DEBUG - eject: /dev/sr0: not mounted 2016-10-26 21:50:52 - Eject - DEBUG - eject: /dev/sr0: is whole-disk device 2016-10-26 21:50:52 - Eject - DEBUG - eject: /dev/sr0: trying to eject using CD-ROM eject command 2016-10-26 21:50:52 - Eject - DEBUG - eject: CD-ROM eject command succeeded 2016-10-26 21:50:52 - Compress - DEBUG - Compressing initialised 2016-10-26 21:50:52 - Compress - DEBUG - Looking for videos to compress Traceback (most recent call last): File "autorippr.py", line 419, in compress(config) File "autorippr.py", line 272, in compress dbvideo.filename, dbvideo.vidname)) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128)

brandonganem commented 7 years ago

les_miserables.txt

JasonMillward commented 7 years ago

Thanks for the detailed logs, I'll see what I can do about it this weekend

srounet commented 7 years ago

I think my Pull-Request address this problems, had the exact same issue with: "Master_and_Commander_De_l'autre_côté_du_monde_t00.mkv"

This commit address the issue by mapping accentuated characters, and removing 'some' special characters like quote or double quotes.

knoer commented 7 years ago

I think I hit something similar yesterday, when I got around to install and try Autorippr..

In Danish (and Norwegian), we use some special characters; Æ/æ, Ø/ø and Å/å -similarly, in Swedish, these are Ä/ä, Ö/ö and Å/å Normally, if using a non-Nordic keyboard, would be substituted with AE/ae, OE/oe and AA/aa.

As part of my test yesterday, I tried to rip and compress a DVD of "dinner for one", which in Danish is called "90 års fødselsdagen" ("the 90 year birthday") and hence, the DVD title is "90ÅRS"

The ripping part of the supposedly worked as expected, but immediately after makeMKV finished, I got an error. Unfortunately, I currently do not have access to my log files, but I recall a line similar to brandonganem's

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128)

only the character in question was (again, as I recall) 0'\x3c'

Fortunately, this error is very reproducible - which I will, and get back with further details/logs.. (I realise, that in order for this file to be recognized by any scraper tool, I may need to rename this to the english title - and I can work around this by manually sending this file through Handbrake)

Question is, is this something that is worthwile to implement a fix for? And if so, could/should this be done by one of the following approaches?

implement substitution of these characters in a function in classes/utils.py
implement support for 2-byte UTF-8 characters in general

JasonMillward commented 7 years ago

@knoer if you check out this pull request https://github.com/JasonMillward/Autorippr/pull/125 you can see that @srounet has addressed the issue.

If you check out a copy of the master branch you should get these changes and they might solve your problem too.

knoer commented 7 years ago

I checked out the repo just last night, and looking through autorippr.py, I am sure I recall the comment about the Master and Commander string conversion, so I should be running the latest code..?

If I recall, the Nordic special characters are part of iso-8859-1 - my system may very well be running this charset as default for the same reason - I will investigate if this is true during the weekend.

I find it hard to estimate the value of spending time on handling a singular case of a character conversion gone wrong if there are no one else having the same issues. -so I'll let you be the judge of that.

In any case, I guess I will try experimenting a a bit with charset encoding before attempting to do a string cleanup using the functions in util.py.. A couple of years ago I tried getting into Python, this might be a good time to pick it up again.. :-)

As a general solution, could it be possible to detect the system charset and decode strings from this format to UTF-8 as a part of the string cleanup process - maybe selectable by a parameter in settings.cfg?

This minor issue aside, I still find Autorippr an awesome tool for backing up media (and avoiding the kids (man)handling disks) at home!

JasonMillward / Autorippr

UTF Characters in Movie title. #124