Open HertzDevil opened 1 year ago
I suppose another option could be to implement UTF-16 encoder (and decoder) in Crystal? This would be more idiomatic using the existing API for this purpose.
I was only thinking of encoding here. Supporting UTF-16 as a non-iconv encoding in IO
directly would mean also implementing the decoder.
In fact we could aim for an even more fundamental API than IO
:
struct Char
def each_utf16_code_unit(& : UInt16 ->)
end
end
class String
def each_utf16_code_unit(& : UInt16 ->)
each_char &.each_utf16_code_unit { |unit| yield unit }
end
def to_utf16 : Slice(UInt16)
# ...
slice = Slice(UInt16).new(u16_size + 1)
appender = slice.to_unsafe.appender
each_utf16_code_unit { |unit| appender << unit }
appender << 0
slice[0, u16_size]
end
end
module Crystal::System::Env
def self.make_env_block(env : Enumerable({String, String}))
io = IO::Memory.new
env.each do |(key, value)|
check_valid_key(key)
parts = {
key.check_no_null_byte("key"),
"=",
value.check_no_null_byte("value"),
"\0",
}
parts.each &.each_utf16_code_unit &.to_io(io)
end
'\0'.each_utf16_code_unit &.to_io(io)
io.to_slice.to_unsafe
end
end
the same way Char#each_byte
and String#each_byte
work as UTF-8, or Char#to_i32
and String#each_char
work as UTF-32.
The only way to write a
Char
orString
's UTF-16 code units to an unencodedIO::Memory
is to callString#to_utf16
first, which produces aSlice(UInt16)
, then write the wholeSlice
via#to_unsafe_bytes
or write each code unit individually. There is this example in the standard library:https://github.com/crystal-lang/crystal/blob/ea92174624fb28bd1870674b28f326bdb4b60d98/src/crystal/system/win32/env.cr#L85-L95
The intermediate allocation is really unnecessary because we can easily convert each individual character to UTF-16 before moving on to the next one. So I wonder if we could have
String#to_utf16(IO)
on top of the existing overload:The unencoded part is important, because this function must work on Windows even when libiconv is unavailable; this API continues the tradition of providing UTF-16 functionality in the standard library that doesn't depend on libiconv. Consequently,
Char#to_utf16
here should callUInt16#to_io
, with the given endianness, which bypasses the targetIO
's encoding altogether.