confluentinc / bottledwater-pg

Change data capture from PostgreSQL into Kafka
http://blog.confluent.io/2015/04/23/bottled-water-real-time-integration-of-postgresql-and-kafka/
Apache License 2.0
2 stars 149 forks source link

JSON encoding for binary types is confusing and inefficient #69

Open samstokes opened 8 years ago

samstokes commented 8 years ago

Bottled Water's current JSON encoding (--output-format=json) is as per the Avro JSON encoding spec, which encodes binary data as a JSON string by mapping bytes 1-1 to Unicode codepoints in the range 0-255 (i.e. \u0000 - \u00ff in JSON string syntax).

This produces a valid JSON string, since JSON strings may contain arbitrary Unicode codepoints provided they are escaped in that manner. Note however that this is actually quite different from normal Unicode encoding, which encodes strings (i.e. sequences of codepoints) as sequences of bytes; this is encoding a byte sequence as a string.

This has a couple of downsides. First is that this is pretty confusing - consumers parsing the message and trying to interpret the string as a normal Unicode string will get something nonsensical. To retrieve the original bytes, you need to look at the codepoints of the resulting string, not the bytes of the resulting string as you might intuitively guess. e.g. in Ruby:

# original bytes were 0xbeef00, aka [190, 239, 0]
beef = JSON.parse('{"bytes":"\u00BE\u00EF\u0000"}')['bytes']
# => "¾ï\u0000"
beef.bytes
# => [194, 190, 195, 175, 0] # not what we started with!
beef.codepoints
# => [190, 239, 0] # that's better

Some languages refuse to even parse such JSON if it contains codepoints not usually allowed in a string, even though strictly speaking it is valid JSON. e.g. in Node:

beef = JSON.parse('{"bytes":"\u00BE\u00EF\u0000"}')
# SyntaxError: Unexpected token 
#    at Object.parse (native)

Strictly I believe this is a bug in Node's JSON parser, but it's still going to trip people up.

Secondly, this encoding is very inefficient, encoding every byte as 6 bytes. JSON encoding already isn't the most efficient in the world, but a 600% blowup is pretty bad.

We should probably just base64 the binary data and store that in a string instead, e.g. {"base64":"vu8A"}.

Unfortunately, we probably can't just change avro-c's encoding to do this, since its JSON encoding is defined in the Avro spec (although maybe that should change too?). Instead we will probably need to stop leaning on avro-c to generate our JSON for us.

phiresky commented 6 years ago

node parses that json fine btw, you just need to escape the json string correctly

JSON.parse('{"bytes":"\\u00BE\\u00EF\\u0000"}')