Closed QtRoS closed 2 years ago
That probably makes sense. I think UTF-8
is a safe default and unless anyone needs a different encoding we don't even need a setting for the file encoding.
I just checked https://www.psycopg.org/docs/usage.html#strings-adaptation and it seams that just reading it as binary would assume the file encoding to be identical to the connection encoding so that wouldn't help either. Right now the code never specifies the client_encoding
thus in theory it could even fail when reading files as UTF-8
if there are characters in the file which aren't supported by the client_encoding
or vice verse.
It probably makes sense to also call Connection.set_client_encoding
as well?
Hello, @bikeshedder, thank you for your answer!
Seems that you are digging deeper than me. In my case the only problem is that my schema-files are in utf8, so they can't be simply read with open(...)
. When I tried to compare schema with DB, tusker failed with the very basic Python problem: can't decode bytes, bla-bla. So I tried to fix it as simply as that. Have no ideas about connection's encoding, this is out of my experience:)
By the way, now I've realized that schema-files can be stored, for example, in UTF with BOM as well and in any other (much more rare) encoding. So maybe I should add option in TOML, for example "schema_file_encoding"? Defaults to utf8. It's much more versatile.
@bikeshedder please check my latest commit too. I think it's now handy enough - you can simply set encoding in config file. Both for schema and migrations.
I'm slightly hesitant about merging this feature with the added configuration option. It has one implication that we haven't thought about, yet.
What if you're using a shell that's not UTF-8? The current code just does a print(sql, end='')
. In my own tests Python 3 nowadays seams to always assume UTF-8 shells so anything but UTF-8 for schema and migration files doesn't seam very useful.
I guess until someone else complains and wants to use schema
or migration
files with an encoding other than UTF-8 we should just hard code it to UTF-8.
I jsut wanted to implement it in a bit more flexible way. Have no complaints about hardcoding it to UTF8. If you ok with that, I will revert commit (git still has it history for possible future usage).
I just checked the documentation of https://docs.python.org/3/library/functions.html#open and after thinking about it the configuration option makes sense. Though it should also affect the terminal encoding. Otherwise the tusker diff
command is bound to fail.
On my system locale.getpreferredencoding(False)
returns UTF-8
so I never ran into this issue.
I wonder what you're getting. What happens when you run tusker diff
(with your patch applied) and pipe that result into a file. If I'm not mistaken you should end up with a file encoded with your system locale and not UTF-8.
I did experiment with this a bit. I've change my terminal to ISO-8859-1 (Latin1) encoding and ended up with the following situation:
The file is read as UTF-8, the migrations are read as UTF-8 but the output is written as Latin1.
The problem is the sys.stdout
which does report it being iso8859-1
encoding:
>>> sys.stdout
<_io.TextIOWrapper name='<stdout>' mode='w' encoding='iso8859-1'>
I think a single encoding
parameter that changes the encoding of all files involved and stdout would be way better than having three separate values such as schema
encoding, migrations
encoding and stdout
encoding, no?
I wonder what you're getting
cp1251 actually
What happens when you run tusker diff (with your patch applied) and pipe that result into a file
You are right, I've got cp1251 output as well.
I think a single encoding parameter that changes the encoding of all files involved and stdout would be way better
Not sure about that. You know, there is such thing as legacy, one can have source files in cp1251 and system locale UTF8 for some reason. And I am not sure if it is good practice to change terminal's encoding - not seen this in projects so far. So I would rather stop at source files encoding, maybe with global option (not for both schema and migrations).
tusker diff
is meant to be viewed and then piped into a file. I guess that's where the trouble started.
After some hence and forth I came to the conclusion that I like the following behavior best:
tusker diff
always uses the terminal encoding for its output. Regardless if it's a TTY or not. That way it doesn't break chains of pipes like tusker diff | less
or tusker diff | tee ...
. Users don't like surprises when working with command line tools and that one would be especially frustrating when run into. This variant of command will be changed to print an error message to stderr
if the terminal encoding doesn't match the schema/migration encoding and stdout.isatty()
returns False
. That way when piping it into a file the only line visible to the user will be a warning about a possible encoding mismatch.tusker diff filename.sql
gets added as an alternative to piping the contents which does respect the configured encoding. That way users with non-UTF-8 terminals can view their diffs on the terminal and still get properly encoded migration files. The README
needs to be changed and remove the >
from all examples. A section about that behavior should also be added to the FAQ.tusker diff --encoding=...
gets added which makes it possible to override the output encoding of the tusker diff
command. This is a convenience feature for users that write pipe chains (possibly spanning multiple hosts, docker environments, etc.) and need a specific encoding regardless of what's set in the terminal. I think a --utf8/--utf-8
switch would be nice to have as a shortcut for the unwiedly --encoding=utf-8
.While we're at it I think we should also add a SET client_encoding = 'UTF8';
to the output when generating diffs via tusker diff filename.sql
or tusker diff --encoding=...
so that the encoding information is present in the generated migration file. A normal call to tusker diff
should not include this line as it could break things like tusker diff | psql
where psql
defaults to the terminal encoding and/or yet another configuration file. So it's probably better not to touch the encoding at all. Not that this is a good way of using Tusker, but I can see people doing similar things and I don't want to break peoples pipelines... :see_no_evil:
Sounds fine! But what about:
--encoding=...
too?tusker diff filename.sql gets added as an alternative to piping the contents which does respect the configured encoding
Tried it - works fine, no pain at all.
Are you going to create a new PR with those changes?
UTF8 is "go to" encoding these days. So I suggest to specify it directly. Maybe it's better to add command line argument, I dunno if it worth it.