crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.45k stars 1.62k forks source link

Native text encoding conversions #15001

Open HertzDevil opened 1 month ago

HertzDevil commented 1 month ago

Crystal currently relies on iconv or GNU libiconv for conversions between text encodings. This has a few problems:

The essence of, for example, UTF-16 to UTF-8 conversion can be implemented on top of iconv's function signature as:

def iconv_utf16_to_utf8(in_buffer : UInt8**, in_buffer_left : Int32*, out_buffer : UInt8**, out_buffer_left : Int32*)
  utf16_slice = in_buffer.value.to_slice(in_buffer_left.value).unsafe_slice_of(UInt16)
  String.each_utf16_char(utf16_slice) do |ch|
    in_bytesize = ch.ord >= 0x10000 ? 4 : 2
    ch_bytesize = ch.bytesize
    break unless out_buffer_left.value >= ch_bytesize

    ch.each_byte do |b|
      out_buffer.value.value = b
      out_buffer.value += 1
    end

    in_buffer.value += in_bytesize
    in_buffer_left.value -= in_bytesize
    out_buffer_left.value -= ch_bytesize
  end
end

str = Bytes[0x61, 0x00, 0x62, 0x00, 0x3D, 0xD8, 0x02, 0xDE, 0x63, 0x00]
bytes = uninitialized UInt8[32]

in_buffer = str.to_unsafe
in_buffer_left = str.bytesize
out_buffer = bytes.to_unsafe
out_buffer_left = bytes.size
iconv_utf16_to_utf8(pointerof(in_buffer), pointerof(in_buffer_left), pointerof(out_buffer), pointerof(out_buffer_left))

String.new(bytes.to_slice[0, bytes.size - out_buffer_left]) # => "ab😂c"

Going in the opposite direction would need something like #13639 to be equally concise, but the point is that we could indeed achieve this without using iconv at all. If both the source and destination encodings are one of UTF-8, UTF-16, UTF-32, or maybe ASCII, then we could use our own native transcoders instead of iconv; or if we are ambitious enough, we could port the entire set of ICU character set mapping tables in an automated manner, and remove our dependency on iconv.

ysbaddaden commented 1 month ago

A pure crystal implementation would be lovely. For the sake of the argument, are there alternatives to libiconv?

HertzDevil commented 1 month ago
ysbaddaden commented 1 month ago

Thank you 🙇

ysbaddaden commented 1 month ago

The W3C Encoding Standard already sets the bar quite high, but seems to support a good list of general encodings :+1:

There's a part 2 to the comparison article that focuses on C and presents ztd.cuneicode. I'm not saying we should use it, but it sounds like a solid reference, and both articles are treasure trove of information.