cgmb / guardonce

Utilities for converting from C/C++ include guards to #pragma once and back again.
MIT License
142 stars 3 forks source link

Handling non-ascii encoding with --stdout #26

Closed kwesolowski closed 5 years ago

kwesolowski commented 6 years ago

I was unlucky to apply it to some UTF8 code, yielding error:

Error processing /home/kwesolow/.... (UnicodeEncodeError) 'ascii' codec can't encode character '\xe9' in position 123938: ordinal not in range(128)

The error is not there when running in place?

With clang-format it work ok, and I can for example capture correctly encoded output via fixed_content = subprocess.check_output([GUARD2ONCE_CMD, path, '--stdout']).decode()

cgmb commented 6 years ago

I think this is because I can set the encoding when opening a file to write to, but I can't set the encoding for stdout.

What OS and what version of Python are you using? If you have both Python 2 and 3 available, you might want to try the other. This works quite differently on each.

kwesolowski commented 6 years ago

I installed this via pip3, so that's what I use. I think that some idea might be in https://stackoverflow.com/questions/4374455/how-to-set-sys-stdout-encoding-in-python-3/33470043, and I think utf-8 for stdout would be safe (it will work for both ASCII and UTF-8 payload).

cgmb commented 6 years ago

That might be an easy fix then.

cgmb commented 6 years ago

After a bit of testing, it seems that this happens only with Python 3 while using --stdout. The most likely place you'll encounter it is on Windows when piping the output into some other program, though the behaviour can be reproduced on any platform if you go out of your way to break it.

The problem is that stdout is claiming to be an ascii stream and, if that's really the case, it's not possible to correctly output a utf-8 file containing non-ascii characters into that stream. Arguably, the real bug is that guardonce under Python 2 doesn't fail with an error! Of course, it's probably more likely that Python 3 is wrong and the output stream is not really limited to ascii. In that case, changing the output encoding to utf-8 would indeed fix the problem.

As a workaround, try running set PYTHONIOENCODING=utf-8 before running guard2once. That environment variable was mentioned in the stackoverflow link you posted and it seems to work in my testing with Python 3.6.

kwesolowski commented 6 years ago

I experienced it on Ubuntu 16.04 running Docker with Ubuntu 16.04, so not only Windows. And i confirm that PYTHONIOENCODING=utf-8 call-my-tool.py works as good workaround.

My tool captures output by subprocess.check_output(guard2once, ....).decode(), so under the hood pipes/streams might be used.

cgmb commented 6 years ago

Odd. I only managed to reproduce the error on Ubuntu by setting PYTHONIOENCODING=ascii. Otherwise it worked fine for me, even using subprocess.check_output(guard2once, ....).decode() with Ubuntu 16.04's python3 (3.5.2).

There seem to be workarounds for all major bugs I'm aware of, so I'm going to take my time on getting the next release out. There's a lot I want to change and I'm extremely busy with my thesis, so it will take a while.

cgmb commented 5 years ago

So, here's the state of things:

This behaviour is different between Python 2 and Python 3. When run with Python 2, guardonce will always output in the original file encoding. When run with Python 3, guardonce will always output in the original file encoding when writing files, but will attempt to output in your terminal's encoding when using --stdout. You can override the encoding that Python has guessed for stdout by setting PYTHONIOENCODING and PYTHONLEGACYWINDOWSSTDIO.

I'm not going to say that I'm in love with the Python 3 behaviour, but it seems that's how things are supposed to work. Trying to do something different would be a lot of effort, and I'm not sure it would be worth it.