cgmb / guardonce

Utilities for converting from C/C++ include guards to #pragma once and back again.
MIT License
142 stars 3 forks source link

Failures handling unicode in headers when using Python 3 on Windows #21

Closed cgmb closed 6 years ago

cgmb commented 6 years ago

My second encoding bug. :\

Apparently, the default encoding for Windows is CP-1252. A not-uncommon scenario would be processing a file that turns out to be UTF-8. In that case, decoding will fail during the file read if the file contains a byte sequence that's invalid for CP-1252. Fancy quotes, for example.

This only happens for Python 3, because in Python 2 the string isn't decoded. There's really no need to decode it, because any string of characters outside of the ASCII range is irrelevant to guardonce, and can be passed through without modification. To my knowledge, the only popular-ish encodings that mangle the ASCII range are UTF16 and UTF32, so aside from files with those encodings the Python 2 method of being Unicode-oblivious works great.

In general, there's no way to know the encoding of a given file. Given that my parsing will work on nearly all encodings aside from UTF16 and UTF32, I'm tempted to switch to bytestrings in Python 3 so that I get the same behaviour as Python 2 and so I can bypass the whole character-encoding guessing game.