UTF8 as encoding for input files

QtRoS commented 2 years ago

UTF8 is "go to" encoding these days. So I suggest to specify it directly. Maybe it's better to add command line argument, I dunno if it worth it.

bikeshedder commented 2 years ago

That probably makes sense. I think UTF-8 is a safe default and unless anyone needs a different encoding we don't even need a setting for the file encoding.

I just checked https://www.psycopg.org/docs/usage.html#strings-adaptation and it seams that just reading it as binary would assume the file encoding to be identical to the connection encoding so that wouldn't help either. Right now the code never specifies the client_encoding thus in theory it could even fail when reading files as UTF-8 if there are characters in the file which aren't supported by the client_encoding or vice verse.

It probably makes sense to also call Connection.set_client_encoding as well?

QtRoS commented 2 years ago

Hello, @bikeshedder, thank you for your answer!

Seems that you are digging deeper than me. In my case the only problem is that my schema-files are in utf8, so they can't be simply read with open(...). When I tried to compare schema with DB, tusker failed with the very basic Python problem: can't decode bytes, bla-bla. So I tried to fix it as simply as that. Have no ideas about connection's encoding, this is out of my experience:)

By the way, now I've realized that schema-files can be stored, for example, in UTF with BOM as well and in any other (much more rare) encoding. So maybe I should add option in TOML, for example "schema_file_encoding"? Defaults to utf8. It's much more versatile.

QtRoS commented 2 years ago

@bikeshedder please check my latest commit too. I think it's now handy enough - you can simply set encoding in config file. Both for schema and migrations.

bikeshedder commented 2 years ago

I'm slightly hesitant about merging this feature with the added configuration option. It has one implication that we haven't thought about, yet.

What if you're using a shell that's not UTF-8? The current code just does a print(sql, end=''). In my own tests Python 3 nowadays seams to always assume UTF-8 shells so anything but UTF-8 for schema and migration files doesn't seam very useful.

I guess until someone else complains and wants to use schema or migration files with an encoding other than UTF-8 we should just hard code it to UTF-8.

QtRoS commented 2 years ago

I jsut wanted to implement it in a bit more flexible way. Have no complaints about hardcoding it to UTF8. If you ok with that, I will revert commit (git still has it history for possible future usage).

bikeshedder commented 2 years ago

I just checked the documentation of https://docs.python.org/3/library/functions.html#open and after thinking about it the configuration option makes sense. Though it should also affect the terminal encoding. Otherwise the tusker diff command is bound to fail.

bikeshedder commented 2 years ago

On my system locale.getpreferredencoding(False) returns UTF-8 so I never ran into this issue.

I wonder what you're getting. What happens when you run tusker diff (with your patch applied) and pipe that result into a file. If I'm not mistaken you should end up with a file encoded with your system locale and not UTF-8.

bikeshedder commented 2 years ago

I did experiment with this a bit. I've change my terminal to ISO-8859-1 (Latin1) encoding and ended up with the following situation:

The file is read as UTF-8, the migrations are read as UTF-8 but the output is written as Latin1.

The problem is the sys.stdout which does report it being iso8859-1 encoding:

>>> sys.stdout
<_io.TextIOWrapper name='<stdout>' mode='w' encoding='iso8859-1'>

I think a single encoding parameter that changes the encoding of all files involved and stdout would be way better than having three separate values such as schema encoding, migrations encoding and stdout encoding, no?

QtRoS commented 2 years ago

I wonder what you're getting

cp1251 actually

What happens when you run tusker diff (with your patch applied) and pipe that result into a file

You are right, I've got cp1251 output as well.

I think a single encoding parameter that changes the encoding of all files involved and stdout would be way better

Not sure about that. You know, there is such thing as legacy, one can have source files in cp1251 and system locale UTF8 for some reason. And I am not sure if it is good practice to change terminal's encoding - not seen this in projects so far. So I would rather stop at source files encoding, maybe with global option (not for both schema and migrations).

bikeshedder commented 2 years ago

tusker diff is meant to be viewed and then piped into a file. I guess that's where the trouble started.

After some hence and forth I came to the conclusion that I like the following behavior best:

tusker diff always uses the terminal encoding for its output. Regardless if it's a TTY or not. That way it doesn't break chains of pipes like tusker diff | less or tusker diff | tee .... Users don't like surprises when working with command line tools and that one would be especially frustrating when run into. This variant of command will be changed to print an error message to stderr if the terminal encoding doesn't match the schema/migration encoding and stdout.isatty() returns False. That way when piping it into a file the only line visible to the user will be a warning about a possible encoding mismatch.
tusker diff filename.sql gets added as an alternative to piping the contents which does respect the configured encoding. That way users with non-UTF-8 terminals can view their diffs on the terminal and still get properly encoded migration files. The README needs to be changed and remove the > from all examples. A section about that behavior should also be added to the FAQ.
tusker diff --encoding=... gets added which makes it possible to override the output encoding of the tusker diff command. This is a convenience feature for users that write pipe chains (possibly spanning multiple hosts, docker environments, etc.) and need a specific encoding regardless of what's set in the terminal. I think a --utf8/--utf-8 switch would be nice to have as a shortcut for the unwiedly --encoding=utf-8.

While we're at it I think we should also add a SET client_encoding = 'UTF8'; to the output when generating diffs via tusker diff filename.sql or tusker diff --encoding=... so that the encoding information is present in the generated migration file. A normal call to tusker diff should not include this line as it could break things like tusker diff | psql where psql defaults to the terminal encoding and/or yet another configuration file. So it's probably better not to touch the encoding at all. Not that this is a good way of using Tusker, but I can see people doing similar things and I don't want to break peoples pipelines... :see_no_evil:

QtRoS commented 2 years ago

Sounds fine! But what about:

Encoding of input schema/migration files? Should we hardcode it to UTF8 or bind it to --encoding=... too?
These changes kinda complex and should be implemented by your design. Who do you think sould implement it? I am ok if you do it yourself. I can help if you want.

QtRoS commented 2 years ago

tusker diff filename.sql gets added as an alternative to piping the contents which does respect the configured encoding

Tried it - works fine, no pain at all.

bikeshedder commented 2 years ago

Are you going to create a new PR with those changes?

bikeshedder / tusker

UTF8 as encoding for input files #19