apache / fury

A blazingly fast multi-language serialization framework powered by JIT and zero-copy.
https://fury.apache.org/
Apache License 2.0
2.97k stars 218 forks source link

[Java] implement fast utf16 to utf8 conversion #1754

Open chaokunyang opened 1 month ago

chaokunyang commented 1 month ago

Is your feature request related to a problem? Please describe.

Currently Fury use java.lang.StringCoding#encode(java.nio.charset.Charset, char[], int, int) to convert utf16 to utf8.

  static byte[] encode(Charset cs, char[] ca, int off, int len) {
        CharsetEncoder ce = cs.newEncoder();
        int en = scale(len, ce.maxBytesPerChar());
        byte[] ba = new byte[en];
        if (len == 0)
            return ba;
        boolean isTrusted = false;
        if (System.getSecurityManager() != null) {
            if (!(isTrusted = (cs.getClass().getClassLoader0() == null))) {
                ca =  Arrays.copyOfRange(ca, off, off + len);
                off = 0;
            }
        }
        ce.onMalformedInput(CodingErrorAction.REPLACE)
          .onUnmappableCharacter(CodingErrorAction.REPLACE)
          .reset();
        if (ce instanceof ArrayEncoder) {
            int blen = ((ArrayEncoder)ce).encode(ca, off, len, ba);
            return safeTrim(ba, blen, cs, isTrusted);
        } else {
            ByteBuffer bb = ByteBuffer.wrap(ba);
            CharBuffer cb = CharBuffer.wrap(ca, off, len);
            try {
                CoderResult cr = ce.encode(cb, bb, true);
                if (!cr.isUnderflow())
                    cr.throwException();
                cr = ce.flush(bb);
                if (!cr.isUnderflow())
                    cr.throwException();
            } catch (CharacterCodingException x) {
                throw new Error(x);
            }
            return safeTrim(ba, bb.position(), cs, isTrusted);
        }
    }

This invoke sun.nio.cs.UTF_8.Encoder#encode:

        public int encode(char[] sa, int sp, int len, byte[] da) {
            int sl = sp + len;
            int dp = 0;
            int dlASCII = dp + Math.min(len, da.length);

            // ASCII only optimized loop
            while (dp < dlASCII && sa[sp] < '\u0080')
                da[dp++] = (byte) sa[sp++];

            while (sp < sl) {
                char c = sa[sp++];
                if (c < 0x80) {
                    // Have at most seven bits
                    da[dp++] = (byte)c;
                } else if (c < 0x800) {
                    // 2 bytes, 11 bits
                    da[dp++] = (byte)(0xc0 | (c >> 6));
                    da[dp++] = (byte)(0x80 | (c & 0x3f));
                } else if (Character.isSurrogate(c)) {
                    if (sgp == null)
                        sgp = new Surrogate.Parser();
                    int uc = sgp.parse(c, sa, sp - 1, sl);
                    if (uc < 0) {
                        if (malformedInputAction() != CodingErrorAction.REPLACE)
                            return -1;
                        da[dp++] = repl;
                    } else {
                        da[dp++] = (byte)(0xf0 | ((uc >> 18)));
                        da[dp++] = (byte)(0x80 | ((uc >> 12) & 0x3f));
                        da[dp++] = (byte)(0x80 | ((uc >>  6) & 0x3f));
                        da[dp++] = (byte)(0x80 | (uc & 0x3f));
                        sp++;  // 2 chars
                    }
                } else {
                    // 3 bytes, 16 bits
                    da[dp++] = (byte)(0xe0 | ((c >> 12)));
                    da[dp++] = (byte)(0x80 | ((c >>  6) & 0x3f));
                    da[dp++] = (byte)(0x80 | (c & 0x3f));
                }
            }
            return dp;
        }

This implementation is not effficient enough, we need a faster one.

Describe the solution you'd like

Additional context

manojks1999 commented 1 month ago

@chaokunyang , can I work on this ?

chaokunyang commented 1 month ago

@chaokunyang , can I work on this ?

Of course, feel free to take over it.

FormerKinG commented 3 weeks ago

Can I work on this issue ? If yes, I need the class name, I'm new to the contibution and looking forward to working on Fury.

chaokunyang commented 2 weeks ago

Hi @FormerKinG , thanks for the willingnesee to contribute to Apache Fury. You can take org.apache.fury.serializer.StringSerializer as the start point