Failures handing UTF-16, UTF-32 encoded files

It seems that Visual Studio may generate UTF-16 header files for Resource Files. An example of such a file is renderdoccmd/resource.h. I expect that UTF-32 files have the same problem, though I have never encountered one.

Under Python 2, guardonce actually happens to handle this case correctly, as resource files don't have guards. checkguard notes that no guard was found, and both guard2once and once2guard ignore it. This is not because guardonce is behaving intelligently. Even if there were a guard, it would not be recognized, and guardonce would exhibit the same behaviour. That's not ideal, but as long as checkguard is telling you that the files are a problem, and as long as guard2once and once2guard do no harm to the files, it's acceptable.

Under Python 3, guardonce fails to decode the file to string, and prints a cryptic error message:

'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

This is from Linux, where utf-8 is the default codec. There's probably a different message under Windows. The behaviour is mostly the same as under Python 2, but all utilities print that error message, and checkguard does not print out the file name. It's hard to track down what file has the problem, because I'm not including enough information in that error message. That's not acceptable.

It's hard to say what the right thing to do is. Programs like file and vim will guess these encodings, though sed and gcc won't. UTF-16 and UTF-32 are pretty distinctive. They will have a BOM, and it's very likely that a large percentage of bytes in the file are going to be null. It's very unlikely that a real C header would start with the BOM characters in any encoding, or be full of null bytes.

Another possibility is to allow the user to specify the encodings of their files, but that may be complicated, as even in the renderdoc example above, most files in the repository are UTF-8 and there's only a single UTF-16 file. Many developers probably don't know how all their files are encoded, and there's probably a mixture of encodings within the repository.

At least for now, the plan is to make Python 3's behaviour match Python 2. Everything beyond complaining about and ignoring these files is a bonus.

cgmb / guardonce

Failures handing UTF-16, UTF-32 encoded files #22