crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.46k stars 1.62k forks source link

`IO#gets(limit)` could produce incomplete UTF-8 strings #11989

Open HertzDevil opened 2 years ago

HertzDevil commented 2 years ago

If the limit argument is passed to IO#gets, it behaves like a hard limit and the IO always consumes that number of bytes when the delimiter is not found. This means it may stop in the middle of a UTF-8 byte sequence:

str = "ABCD" # => "\xEF\xBC\xA1\xEF\xBC\xA2\xEF\xBC\xA3\xEF\xBC\xA4"
(0..16).each do |i|
  p [i, IO::Memory.new(str).gets(i)]
end
[0, ""]
[1, "\xEF"]
[2, "\xEF\xBC"]
[3, "A"]
[4, "A\xEF"]
[5, "A\xEF\xBC"]
[6, "AB"]
[7, "AB\xEF"]
[8, "AB\xEF\xBC"]
[9, "ABC"]
[10, "ABC\xEF"]
[11, "ABC\xEF\xBC"]
[12, "ABCD"]
[13, "ABCD"]
[14, "ABCD"]
[15, "ABCD"]
[16, "ABCD"]

It also seems to ignore custom encodings, so limit represents the maximum number of bytes returned, not the maximum number of bytes read:

str = "ABCD".encode("UTF-32LE")
(0..16).each do |i|
  io = IO::Memory.new(str)
  io.set_encoding "UTF-32LE"
  p [i, io.gets(i)]
  # same results as above, even though each character now occupies 4 bytes instead of 3
end

Ruby's behavior is to consume additional bytes until a character in the IO's encoding is fully consumed, whether the IO uses UTF-8 or not:

require 'stringio'

str = "ABCD"
(0..16).each do |i|
  p [i, StringIO.new(str).gets(i)]
end

str = "ABCD".encode(Encoding::UTF_32LE)
(0..16).each do |i|
  # here the `StringIO`'s external encoding is automatically set to UTF-32LE
  p [i, StringIO.new(str).gets(i).encode(Encoding::UTF_8)]
end
[0, ""]
[1, "A"]
[2, "A"]
[3, "A"]
[4, "AB"]
[5, "AB"]
[6, "AB"]
[7, "ABC"]
[8, "ABC"]
[9, "ABC"]
[10, "ABCD"]
[11, "ABCD"]
[12, "ABCD"]
[13, "ABCD"]
[14, "ABCD"]
[15, "ABCD"]
[16, "ABCD"]
[0, ""]
[1, "A"]
[2, "A"]
[3, "A"]
[4, "A"]
[5, "AB"]
[6, "AB"]
[7, "AB"]
[8, "AB"]
[9, "ABC"]
[10, "ABC"]
[11, "ABC"]
[12, "ABC"]
[13, "ABCD"]
[14, "ABCD"]
[15, "ABCD"]
[16, "ABCD"]

It does so for successive #gets calls too:

io = StringIO.new "ABCD"
io.gets(7) # => "ABC"
io.gets(1) # => "D"

io = StringIO.new "ABCD".encode(Encoding::UTF_32LE)
io.gets(7).encode(Encoding::UTF_8) # => "AB"
io.gets(1).encode(Encoding::UTF_8) # => "C"

Contrast this with Crystal:

io = IO::Memory.new "ABCD"
io.gets(7) # => "AB\xEF"
io.gets(1) # => "\xBC"

io = IO::Memory.new "ABCD".encode("UTF-32LE")
io.set_encoding "UTF-32LE"
io.gets(7) # => "AB\xEF"
io.gets(1) # => "\xBC"

I think we should follow Ruby's behavior here. Note that we cannot consume less than limit bytes here; not all IOs are seekable, and we don't have something like #unget, so we cannot in general discard bytes forming an incomplete sequence after they are read.

HertzDevil commented 2 years ago

Is it even a good idea to leave it a byte limit? Could we have overloads with character limits here?

asterite commented 2 years ago

Oh, I think this is I bug I probably introduced when we added IO#peek and the related optimization in IO#gets (gets_peek). If you remove that, and also remove the overwritten IO::Memory#gets method, it works fine.

My thoughts: we should exactly like Ruby here. In Ruby limit is bytes, but if you are in the middle of a codepoint you just continue until the end of it. You end up returning more bytes than requested, but a limit is always a soft-limit in my mind.