Closed joelittlejohn closed 6 years ago
Well, they use different code paths (the String
constructor for byte arrays, and CharsetDecoder
for streams), but obviously I'd expect them to be equivalent. Can you provide a failing test case?
I'm afraid I'm struggling to create a minimal example here. I have 4k of text that demonstrates the problem that I can't share. I have so far failed to minimize this further or find a good random string that demonstrates the same problem.
Could you list the conversion steps from ByteArrayInputStream to String? I'm having a hard time understanding the high-level conversions that are made using the graph by step debugging byte-streams. Maybe I can take my private example through these manually to see where the errors arise.
The InputStream
is turned into a ByteSource
[1], then the ByteSource
is turned into a CharSequence
[2]. Hope that helps, let me know if you have any other questions.
[1] https://github.com/ztellman/byte-streams/blob/master/src/byte_streams.clj#L526 [2] https://github.com/ztellman/byte-streams/blob/master/src/byte_streams/char_sequence.clj#L81
My hunch is that this is an issue of single characters spanning a chunk-size boundary in byte-streams.char-sequence/lazy-char-buffer-sequence
causing a unicode replacement character to be used by the decoder.
My understanding is that the CharsetDecoder
should handle that properly, but if so then we'd expect the malformed characters to show up at the 4096th byte, since that's the default chunk size. Maybe to make a more minimal test case you can try specifying {:chunk-size 16}
or something in the convert
call?
I need to sleep now, but yes, I can easily create a minimal test case now as I'm pretty certain that the problem is as described above. I confirmed this using the method you described. If I simply create a string of 3-byte chars I can see the error on the 4096th byte boundary. If I reduce the chunk size I see a lot more errors.
Okay, I'll see if I can track down what's happening.
Fixed by 29f50f7
I have a ByteArrayInputStream
b
that contains Chinese characters in UTF-8. I've found that I get bad data from byte-streams when I do this:I get a string in which some characters are corrupt (2 out of a few thousand).
I've found that I get good data when I do this:
I've compared the bytes I get in each of the string results above, and I note that in the first example a tiny handful of bytes (6 out of 17,300) appear to be different to the bytes in the original input stream. In the latter example the bytes are identical to the input (hence no corrupt chars).
What could cause
convert
to treat an input stream of bytes differently to an array of bytes when converting to a string?