Bottled Water's current JSON encoding (--output-format=json) is as per the Avro JSON encoding spec, which encodes binary data as a JSON string by mapping bytes 1-1 to Unicode codepoints in the range 0-255 (i.e. \u0000 - \u00ff in JSON string syntax).
This produces a valid JSON string, since JSON strings may contain arbitrary Unicode codepoints provided they are escaped in that manner. Note however that this is actually quite different from normal Unicode encoding, which encodes strings (i.e. sequences of codepoints) as sequences of bytes; this is encoding a byte sequence as a string.
This has a couple of downsides. First is that this is pretty confusing - consumers parsing the message and trying to interpret the string as a normal Unicode string will get something nonsensical. To retrieve the original bytes, you need to look at the codepoints of the resulting string, not the bytes of the resulting string as you might intuitively guess. e.g. in Ruby:
# original bytes were 0xbeef00, aka [190, 239, 0]
beef = JSON.parse('{"bytes":"\u00BE\u00EF\u0000"}')['bytes']
# => "¾ï\u0000"
beef.bytes
# => [194, 190, 195, 175, 0] # not what we started with!
beef.codepoints
# => [190, 239, 0] # that's better
Some languages refuse to even parse such JSON if it contains codepoints not usually allowed in a string, even though strictly speaking it is valid JSON. e.g. in Node:
Strictly I believe this is a bug in Node's JSON parser, but it's still going to trip people up.
Secondly, this encoding is very inefficient, encoding every byte as 6 bytes. JSON encoding already isn't the most efficient in the world, but a 600% blowup is pretty bad.
We should probably just base64 the binary data and store that in a string instead, e.g. {"base64":"vu8A"}.
Unfortunately, we probably can't just change avro-c's encoding to do this, since its JSON encoding is defined in the Avro spec (although maybe that should change too?). Instead we will probably need to stop leaning on avro-c to generate our JSON for us.
Bottled Water's current JSON encoding (
--output-format=json
) is as per the Avro JSON encoding spec, which encodes binary data as a JSON string by mapping bytes 1-1 to Unicode codepoints in the range 0-255 (i.e.\u0000
-\u00ff
in JSON string syntax).This produces a valid JSON string, since JSON strings may contain arbitrary Unicode codepoints provided they are escaped in that manner. Note however that this is actually quite different from normal Unicode encoding, which encodes strings (i.e. sequences of codepoints) as sequences of bytes; this is encoding a byte sequence as a string.
This has a couple of downsides. First is that this is pretty confusing - consumers parsing the message and trying to interpret the string as a normal Unicode string will get something nonsensical. To retrieve the original bytes, you need to look at the codepoints of the resulting string, not the bytes of the resulting string as you might intuitively guess. e.g. in Ruby:
Some languages refuse to even parse such JSON if it contains codepoints not usually allowed in a string, even though strictly speaking it is valid JSON. e.g. in Node:
Strictly I believe this is a bug in Node's JSON parser, but it's still going to trip people up.
Secondly, this encoding is very inefficient, encoding every byte as 6 bytes. JSON encoding already isn't the most efficient in the world, but a 600% blowup is pretty bad.
We should probably just base64 the binary data and store that in a string instead, e.g.
{"base64":"vu8A"}
.Unfortunately, we probably can't just change avro-c's encoding to do this, since its JSON encoding is defined in the Avro spec (although maybe that should change too?). Instead we will probably need to stop leaning on avro-c to generate our JSON for us.