emericg / OpenSubtitlesDownload

Automatically find and download the right subtitles for your favorite videos!
https://emeric.io/OpenSubtitlesDownload
GNU General Public License v3.0
607 stars 66 forks source link

Unexpected error (line 862): <class 'UnicodeEncodeError'> #78

Closed DcR-NL closed 3 years ago

DcR-NL commented 3 years ago

Tried this tool for the first time today and I can't figure out why I'm getting this error:

>> Downloading 'Dutch' subtitles for '"The Man in the High Castle" Hexagram 64'
Unexpected error (line 862): <class 'UnicodeEncodeError'>

Not sure if it's helpful, but subEncoding on line 861 results in UTF8, which seems to be correct for this sub.

Only settings I've touched:

Environment:

DcR-NL commented 3 years ago

Found the detailed error:

File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.1776.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

Windows assumes cp1252 and can't handle the input.

Changed line 862 to: byteswritten = open(subPath, 'w', encoding=subEncoding).write(decodedStr)

And now it works as expected. If I knew it to be a safe change, I would've created a pull, but I'm not familiar with this code, Python and file encodings in general.

Is using the subEncoding a smart move or should a fixed utf-8 encoding be used here? What do you think?

//Edit: Well, I kind of answered my own question by trying different scenarios. I've run into a situation where it detects ASCII as subEncoding for a .ssa file and blows up again with the original error. Using a fixed utf-8 resolved that situation as well: byteswritten = open(subPath, 'w', encoding='utf-8').write(decodedStr)

polak0v commented 3 years ago

Same issue here, resolved after using what @DcR-NL proposed: byteswritten = open(subPath, 'w', encoding='utf-8').write(decodedStr)

emericg commented 3 years ago

@DcR-NL Hi and thanks spending time to understand this issue.

I have to say I too am not familiar with either python and file encodings ^^ Also, probably no one is... Like you found out, using the encoding provided by the file is error prone (the ascii problem) but hardcoding an encoding is usually just as error prone.

The last time we had a problem like this (which actually impacted the line right above the one we are discussing today) I just settled on using the file encoding provided, but also adding error='replace' so in case of failure, only the more esoteric characters fails, not the entire file. Can you guys give this a go and see if it solves more cases?

byteswritten = open(subPath, 'w', encoding=subEncoding, error='replace').write(decodedStr)
DcR-NL commented 3 years ago

@emericg I see, good call. Thanks for the reply. I've tested your suggestion on all the problematic files I've found earlier, and it seems to work fine with this change! 😄

Watch out for the small typo in the parameter, as it should be errors instead of error:

byteswritten = open(subPath, 'w', encoding=subEncoding, errors='replace').write(decodedStr)
emericg commented 3 years ago

Thanks for the typo, I was just writing in github I did not test it. Allright then, let's try this new line for a while and see if another problem arise...