Closed DcR-NL closed 3 years ago
Found the detailed error:
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.1776.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
Windows assumes cp1252 and can't handle the input.
Changed line 862 to:
byteswritten = open(subPath, 'w', encoding=subEncoding).write(decodedStr)
And now it works as expected. If I knew it to be a safe change, I would've created a pull, but I'm not familiar with this code, Python and file encodings in general.
Is using the subEncoding a smart move or should a fixed utf-8 encoding be used here? What do you think?
//Edit: Well, I kind of answered my own question by trying different scenarios. I've run into a situation where it detects ASCII as subEncoding for a .ssa file and blows up again with the original error. Using a fixed utf-8 resolved that situation as well:
byteswritten = open(subPath, 'w', encoding='utf-8').write(decodedStr)
Same issue here, resolved after using what @DcR-NL proposed:
byteswritten = open(subPath, 'w', encoding='utf-8').write(decodedStr)
@DcR-NL Hi and thanks spending time to understand this issue.
I have to say I too am not familiar with either python and file encodings ^^ Also, probably no one is... Like you found out, using the encoding provided by the file is error prone (the ascii problem) but hardcoding an encoding is usually just as error prone.
The last time we had a problem like this (which actually impacted the line right above the one we are discussing today) I just settled on using the file encoding provided, but also adding error='replace' so in case of failure, only the more esoteric characters fails, not the entire file. Can you guys give this a go and see if it solves more cases?
byteswritten = open(subPath, 'w', encoding=subEncoding, error='replace').write(decodedStr)
@emericg I see, good call. Thanks for the reply. I've tested your suggestion on all the problematic files I've found earlier, and it seems to work fine with this change! 😄
Watch out for the small typo in the parameter, as it should be errors instead of error:
byteswritten = open(subPath, 'w', encoding=subEncoding, errors='replace').write(decodedStr)
Thanks for the typo, I was just writing in github I did not test it. Allright then, let's try this new line for a while and see if another problem arise...
Tried this tool for the first time today and I can't figure out why I'm getting this error:
Not sure if it's helpful, but subEncoding on line 861 results in UTF8, which seems to be correct for this sub.
Only settings I've touched:
Environment: