`UCharPointer` is not enough for a single UTF-8 character

erickguan / ffi-icu

FFI wrappers for ICU. MRI extension with the dynamic C library.

https://github.com/erickguan/ffi-icu

MIT License

35 stars 22 forks source link

`UCharPointer` is not enough for a single UTF-8 character #28

Open erickguan opened 9 years ago

erickguan commented 9 years ago

UCharPointer points to an array of uint16_t which is generally not enough for a single UTF-8 character since the code point range is [0..0x10FFFF].

After unpack U* and write, the higher bits just vanished.

unpack S* is also not helpful at all. The array can't be packed again.

tyler-nguyen commented 8 years ago

UCharPointer should use u_strFromUTF8 or u_strFromUTF8WithSub to convert to UChar string.

http://userguide.icu-project.org/strings/utf-8

erickguan commented 8 years ago

@tyler-nguyen Unfortunately, it's not easy to do so with ffi code. I feel like C code is required in this case.

tyler-nguyen commented 8 years ago

@fantasticfears You can add it with ffi lke this:

    # U_CAPI UChar* U_EXPORT2 u_strFromUTF8(UChar      *dest,
    #                                       int32_t     destCapacity,
    #                                       int32_t    *pDestLength,
    #                                       const char *src,
    #                                       int32_t     srcLength,
    #                                       UErrorCode *pErrorCode)
    attach_function :u_strFromUTF8, "u_strFromUTF8#{suffix}",
                    [:pointer, :int32_t, :pointer, :string, :int32_t, :pointer], :pointer

For srcLength, use bytesize instead of ruby string.length.

erickguan commented 8 years ago

Thanks, I made some snippets earlier in this way. But u_strFromUTF8 may yield its own error. That does requires extra work in Ruby end. And some bindings are purely made for UChar which made me wonder that a C binding sounds much directly as of Ruby's Code Set Indepedent model for string