`File.write` does not always respect the encoding arguments

crystal-lang / crystal

The Crystal Programming Language

https://crystal-lang.org

Apache License 2.0

19.34k stars 1.61k forks source link

`File.write` does not always respect the encoding arguments #11831

Open HertzDevil opened 2 years ago

HertzDevil commented 2 years ago

Originally from https://github.com/crystal-lang/crystal/issues/11586#issuecomment-1036232834, File.write ignores the encoding arguments if content is a Bytes:

File.write("foo", Bytes[0x61, 0x62, 0x63], encoding: "UTF-16LE")
`xxd foo` # => 00000000: 6162 63                                  abc

This happens because the write is done using IO#write, not #write_string.

But it actually happens with IO as well:

io = IO::Memory.new("abc")
File.write("foo", io, encoding: "UTF-16LE")
`xxd foo` # => 00000000: 6162 63                                  abc

File.write("bar", "abc") # okay, writes UTF-8
File.open("bar", "r") do |f|
  File.write("foo", f, encoding: "UTF-16LE")
end
`xxd foo` # => 00000000: 6162 63                                  abc

The write is done using IO.copy, which is a binary operation and ignores encodings. A solution for this case may benefit from #11018.

What normally happens is:

File.write("foo", "abc", encoding: "UTF-16LE")
`xxd foo` # => 00000000: 6100 6200 6300                           a.b.c.

asterite commented 2 years ago

Should it happen with every object? For example if you send an integer, should it be first converted to a string, then encoded using the given encoding? Or this only works for bytes?

HertzDevil commented 2 years ago

It already works for every object except Bytes and IO, provided that object's #to_s uses the correct methods (#11011). In particular it works if content is a String.

asterite commented 2 years ago

Oh, I see. I just looked at the code:

https://github.com/crystal-lang/crystal/blob/e3ffc77b0b1124a8a51570ca7143c4b52275f567/src/file.cr#L715-L716

That should be write_string as you suggest.

straight-shoota commented 2 years ago

Yeah and for IO.copy we would probably have to check that content.encoding == file.encoding. I believe this could be implemented in IO.copy itself. It should be useful for other scenarios and appears to be the right thing to do. When you copy from one IO to another and both have different encodings, there must be a transformation.

HertzDevil commented 2 years ago

If I convert an odd number of bytes from UTF-16LE to UTF-16LE, that byte sequence is guaranteed to be invalid. Is it really a no-op? (IO#write_string doesn't mention this, but it is a no-op for invalid UTF-8 sequences even if the receiver IO's encoding is UTF-8.)

straight-shoota commented 2 years ago

I would say so, yes. The source is already invalid as UTF-16LE, so there's no change in validity.