Should we compress messages to and from Kafka?

jeffgrunewald commented 5 years ago

Can we improve resource utilization and reduce risk of big datasets to the system by compressing messages when they're written to Kafka topics?

Questions:

What algorithm should we use? (:zlib is a built-in erlang library and snappyer is already NIF'd in)
How much compression can we get from messages?
What type of performance impact does that impose on throughput?

Tech Note:

include a performance test with some large number of messages (10,000+) of uncompressed JSON vs. various compression frontrunners

ACs

Recommendation: should we or shouldn't we and what algorithm

LtChae commented 5 years ago

Results:

*** &Xip.base/1 ***
1.1 sec    65K iterations   16.83 μs/op

*** &Xip.zlib/1 ***
1.3 sec    32K iterations   40.09 μs/op

*** &Xip.snappy/1 ***
1.3 sec    65K iterations   20.42 μs/op

*** &Xip.protobuf/1 ***
1.0 sec    32K iterations   30.75 μs/op

*** &Xip.lz4/1 ***
1.2 sec    65K iterations   19.26 μs/op

[byte_size: [base: 400, snappy: 370, zlib: 293, protobuf: 290, lz4: 367]]

Here we see that zlib is by far the slowest, but only achieves the same compression as protobuf for taking 10μ longer. Zlib's gzip format, however, is more portable.

Snappy performs well overall, tied with lz4 for the fastest speed. However its compression for a single message isn't much better than the base json.

Protobuf is a clear winner for speed and compression ratio, but ties our messages to a specific schema that we have to distribute and version. Some additional performance might be gained by making our SmartCity structs into protobuf(able) structs, saving the extra step of destructing them prior to restructing them into Protobuf generated structs.

Lz4 in general is supposed to be a superior compression scheme to snappy, but the erlang nif of it only barely edges it out.

LtChae commented 5 years ago

Methodology: https://github.com/SmartColumbusOS/Xip

UrbanOS-Public / smartcitiesdata

Should we compress messages to and from Kafka? #186