FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
316 stars 136 forks source link

Long strings encoded using streaming form #166

Open clstrfsck opened 5 years ago

clstrfsck commented 5 years ago

In CBORGenerator.java, there is a limit to the length of the string that can be output before the generator will use the indefinite length encoding. The reason behind this appears to be the complexity of determining the length of the encoded byte stream when it does not fit into an internal buffer, given that the length prefix needs to be output before the data.

For now, we have changed the implementation of _writeChunkedString(...) to the following:

    protected final void writeLongString(char[] text, int offset, int len) throws IOException {
        CharBuffer charBuffer = CharBuffer.wrap(text, offset, len);
        ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(charBuffer);
        int byteLen = byteBuffer.remaining();
        _writeLengthMarker(PREFIX_TYPE_TEXT, byteLen);
        _writeBytes(byteBuffer.array(), byteBuffer.position(), byteLen);
    }

It should also be possible to use the existing UTF-8 encoder to write the encoding out in pieces with the addition of a int encodedLength(...) method to precompute and output the length.

Also, thanks for the awesome library!

cowtowncoder commented 5 years ago

@mpsandiford Just to make sure I understand this correctly: I think you are suggesting addition of a CBORGenerator.Feature that would force use of length-prefix for all Strings. Is that correct? That sounds reasonable, I just want to make sure I understand this correctly.

clstrfsck commented 5 years ago

@cowtowncoder — I hadn't thought of it that way, but yes, that sounds like a great way to do it. That way there is an option for "always length prefix" or the more performant "indefinite" encoding.

cowtowncoder commented 5 years ago

Ok cool. As I said, it may take a while to get this done if I do it (would be in 2.10.0 regardless, being new feature) just due to piling up of issues. But I can help with a PR if that is something you could help with? (I try to prioritize helping others above my own work)

clstrfsck commented 5 years ago

Yep, sure, sounds like a good approach. Give me a couple of days and I'll submit a PR with the feature and we can take it from there.