clj-commons / byte-streams

A Rosetta stone for JVM byte representations
417 stars 33 forks source link

Fix issue #30 Unicode decoding in conversion to CharSequence #32

Closed gsnewmark closed 7 years ago

gsnewmark commented 7 years ago

Issue #30 affects us too, so I've looked a bit into it. CharseDecoder's JavaDoc is somewhat vague, but it states that:

In any case, if this method [decode] is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation.

It looks like in case of underflow during the decode operation CharsetDecoder leaves bytes not constituting a full character in the passed input and expects next decode operation to pass these bytes along with additional ones which together form a full character. So I've added merging of the remaining extra-bytes and new in to the undeflow branch of the decoding. It fixes the issue, but I'm not that experienced with byte fiddling, so maybe there is a more effective way to do that.

In case compatibility with Clojure 1.5 is needed, I can remove usage of some-> (the same goes for some? and Clojure < 1.5).

Test could be found in pull request #31.

ztellman commented 7 years ago

Thank you, I've been traveling and hadn't been able to look at this. I'll merge this, and make any performance tweaks myself.

gsnewmark commented 7 years ago

Thanks!