crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.34k stars 1.61k forks source link

Writing UTF-16 to an `IO::Memory` directly #13639

Open HertzDevil opened 1 year ago

HertzDevil commented 1 year ago

The only way to write a Char or String's UTF-16 code units to an unencoded IO::Memory is to call String#to_utf16 first, which produces a Slice(UInt16), then write the whole Slice via #to_unsafe_bytes or write each code unit individually. There is this example in the standard library:

https://github.com/crystal-lang/crystal/blob/ea92174624fb28bd1870674b28f326bdb4b60d98/src/crystal/system/win32/env.cr#L85-L95

The intermediate allocation is really unnecessary because we can easily convert each individual character to UTF-16 before moving on to the next one. So I wonder if we could have String#to_utf16(IO) on top of the existing overload:

struct Char
  def to_utf16(io : IO, format : IO::ByteFormat = IO::ByteFormat::SystemEndian) : Nil
    # ...
  end
end

class String
  def to_utf16(io : IO, format : IO::ByteFormat = IO::ByteFormat::SystemEndian) : Nil
    each_char &.to_utf16(io, format)
  end
end

module Crystal::System::Env
  def self.make_env_block(env : Enumerable({String, String}))
    io = IO::Memory.new
    env.each do |(key, value)|
      check_valid_key(key)
      key.check_no_null_byte("key").to_utf16(io)
      '='.to_utf16(io)
      value.check_no_null_byte("value").to_utf16(io)
      '\0'.to_utf16(io)
    end
    '\0'.to_utf16(io)
    io.to_slice.to_unsafe
  end
end

The unencoded part is important, because this function must work on Windows even when libiconv is unavailable; this API continues the tradition of providing UTF-16 functionality in the standard library that doesn't depend on libiconv. Consequently, Char#to_utf16 here should call UInt16#to_io, with the given endianness, which bypasses the target IO's encoding altogether.

straight-shoota commented 1 year ago

I suppose another option could be to implement UTF-16 encoder (and decoder) in Crystal? This would be more idiomatic using the existing API for this purpose.

HertzDevil commented 1 year ago

I was only thinking of encoding here. Supporting UTF-16 as a non-iconv encoding in IO directly would mean also implementing the decoder.

In fact we could aim for an even more fundamental API than IO:

struct Char
  def each_utf16_code_unit(& : UInt16 ->)
  end
end

class String
  def each_utf16_code_unit(& : UInt16 ->)
    each_char &.each_utf16_code_unit { |unit| yield unit }
  end

  def to_utf16 : Slice(UInt16)
    # ...
    slice = Slice(UInt16).new(u16_size + 1)
    appender = slice.to_unsafe.appender
    each_utf16_code_unit { |unit| appender << unit }
    appender << 0
    slice[0, u16_size]
  end
end

module Crystal::System::Env
  def self.make_env_block(env : Enumerable({String, String}))
    io = IO::Memory.new
    env.each do |(key, value)|
      check_valid_key(key)
      parts = {
        key.check_no_null_byte("key"),
        "=",
        value.check_no_null_byte("value"),
        "\0",
      }
      parts.each &.each_utf16_code_unit &.to_io(io)
    end
    '\0'.each_utf16_code_unit &.to_io(io)
    io.to_slice.to_unsafe
  end
end

the same way Char#each_byte and String#each_byte work as UTF-8, or Char#to_i32 and String#each_char work as UTF-32.