Bytes are read from stream to String incorrectly

clj-commons / byte-streams

A Rosetta stone for JVM byte representations

417 stars 33 forks source link

Bytes are read from stream to String incorrectly #30

Closed joelittlejohn closed 6 years ago

joelittlejohn commented 7 years ago

I have a ByteArrayInputStream b that contains Chinese characters in UTF-8. I've found that I get bad data from byte-streams when I do this:

(byte-streams/convert b String)

I get a string in which some characters are corrupt (2 out of a few thousand).

I've found that I get good data when I do this:

(byte-streams/convert (byte-streams/to-byte-array b) String)

I've compared the bytes I get in each of the string results above, and I note that in the first example a tiny handful of bytes (6 out of 17,300) appear to be different to the bytes in the original input stream. In the latter example the bytes are identical to the input (hence no corrupt chars).

What could cause convert to treat an input stream of bytes differently to an array of bytes when converting to a string?

ztellman commented 7 years ago

Well, they use different code paths (the String constructor for byte arrays, and CharsetDecoder for streams), but obviously I'd expect them to be equivalent. Can you provide a failing test case?

joelittlejohn commented 7 years ago

I'm afraid I'm struggling to create a minimal example here. I have 4k of text that demonstrates the problem that I can't share. I have so far failed to minimize this further or find a good random string that demonstrates the same problem.

Could you list the conversion steps from ByteArrayInputStream to String? I'm having a hard time understanding the high-level conversions that are made using the graph by step debugging byte-streams. Maybe I can take my private example through these manually to see where the errors arise.

ztellman commented 7 years ago

The InputStream is turned into a ByteSource [1], then the ByteSource is turned into a CharSequence [2]. Hope that helps, let me know if you have any other questions.

[1] https://github.com/ztellman/byte-streams/blob/master/src/byte_streams.clj#L526 [2] https://github.com/ztellman/byte-streams/blob/master/src/byte_streams/char_sequence.clj#L81

joelittlejohn commented 7 years ago

My hunch is that this is an issue of single characters spanning a chunk-size boundary in byte-streams.char-sequence/lazy-char-buffer-sequence causing a unicode replacement character to be used by the decoder.

ztellman commented 7 years ago

My understanding is that the CharsetDecoder should handle that properly, but if so then we'd expect the malformed characters to show up at the 4096th byte, since that's the default chunk size. Maybe to make a more minimal test case you can try specifying {:chunk-size 16} or something in the convert call?

joelittlejohn commented 7 years ago

I need to sleep now, but yes, I can easily create a minimal test case now as I'm pretty certain that the problem is as described above. I confirmed this using the method you described. If I simply create a string of 3-byte chars I can see the error on the 4096th byte boundary. If I reduce the chunk size I see a lot more errors.

ztellman commented 7 years ago

Okay, I'll see if I can track down what's happening.

joelittlejohn commented 6 years ago

Fixed by 29f50f7