apache / fury

A blazingly fast multi-language serialization framework powered by JIT and zero-copy.
https://fury.apache.org/
Apache License 2.0
3.08k stars 246 forks source link

[Go] Support convert utf16 encoded string to utf8 string #1545

Open chaokunyang opened 6 months ago

chaokunyang commented 6 months ago

Is your feature request related to a problem? Please describe.

Currently Fury xlang serialization use utf8 for string encoding, which is not performance efficient in many languages.

We introduced utf16 in https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#string . But golang doesn't support utf16, we should support to transcode utf16 encoded string to utf8 string in fury go deserialization.

Describe the solution you'd like

Implement utf16 to utf8 transcoding in fury go. The implementation should use SIMD to provide faster speed.

Additional context

1413

LiangliangSui commented 6 months ago

Hi @chaokunyang , Have you started implementing this feature? If it hasn't been implemented yet, I can take over and implement this.

chaokunyang commented 6 months ago

@LiangliangSui I haven't, feel free to take over it

LiangliangSui commented 6 months ago

Okay, I will do this.

LiangliangSui commented 6 months ago

@chaokunyang We currently use UTF8 for cross-language serialization, and only Java(not cross-language) uses Latin/UTF16.

  public void writeString(MemoryBuffer buffer, String value) {
    if (isJava) {
      writeJavaString(buffer, value);
    } else {
      writeUTF8String(buffer, value);
    }
  }

Will we use UTF16 as the default cross-language String encoding in the future?

I see that the cross-language currently designed in fury_xlang_serialization_spec still uses UTF8 as the default. image

chaokunyang commented 6 months ago

Depends on the language and the string. For golang, since the string is utf-8 encoded already. Fury go will encode data as utf8 string by a copy. But java/javascript/python may encode string as latin1 or utf16 and send to furygo. So we need to support utf16 too. And if the peer language, we may configure furygo use latin1/utf16 by default too.

LiangliangSui commented 6 months ago

But java/javascript/python may encode string as latin1 or utf16 and send to furygo.

Latin1/UTF16 is only used in Language.JAVA and will not be sent to furygo.

LiangliangSui commented 6 months ago

Okay, I got it.

chaokunyang commented 6 months ago

In the future, java/javascript/python may all encode string as latin1/utf16 and send to furygo.