As a developer with limited network resources with CPU time to spare, I want to compress large payloads before they go out on the network.
For the most part, this only applies to Flight/Barrage payloads, our other gRPC messages tend to be small enough to not be worth considering this. With that said, while Flight does have compression options, we should also consider gRPC compression tools.
Arrow Flight's compression tools appear to be difficult to work with - when initiating a DoGet call, there is no metadata nor message field that can be populated to indicate that the resulting RecordBatches can/should be compressed, or what compression formats the receiver will support. Likewise, when sending a DoPut, while the server will acknowledge the incoming stream in some way, there isn't a negotiation for "can I compress these buffers before sending them". Without some kind of signaling to indicate support, our server can't send uncompressed data without the foreknowledge that all clients will support receiving it, etc.
Java
The Java FlightClient appears capable of reading any compressed RecordBatches, but not writing them - FlightClient.startPut instantiates the package-protected PutObserver, who's (also package-protected) superclass OutboundStreamListenerImpl instantiates a VectorUnloader instance using the constructor that disables compression.
Python (and probably C++?)
The pyarrow client appears less limited: the IpcWriteOptions type supports compression information, and that instance appears to be able to be wrapped in a FlightCallOptions instance, then passed to the do_put method to compress buffers as they are serialized. I would then assume that reading via do_get can decompress in the same way.
JavaScript
Since we rolled our own flight library, we need to provide our own decompression implementations. Browsers now come with decompression tools, but at this time they are limited to deflate and gzip, while the only supported compression codecs in Flight are LZ4 and Zstd. This means we'll need to add an external dependency to implement this.
With that said, our current JS use cases will be more sensitive to latency and CPU time than to bandwidth, as large payloads will be limited by the time we spend blocking the UI while reading them, and small payloads likely won't be big enough to be worth compressing.
We could make the assumption that the supported server codecs can be negotiated on the first request/response (with authentication, etc), and from there the client knows what compression it can use.
At a glance, it appears that C++, Python, Java gRPC clients should all support this feature, though I'm not sure yet how to turn them on, or how to handle the "check on the first request" signaling. The Java client however only supports gzip out of the box, Python and C++ support deflate as well.
Our JS gRPC client is not homegrown, but also does not support this feature. However, deflate and gzip are both supported by the browser, so we could fork the gRPC client - supporting this would not add additional library size costs (but might still be less helpful than other clients).
HTTP
HTTP also has its own compression options, but I'm unsure yet if standard gRPC clients will respect it. Headers are very similar to gRPC (accept-encoding and content-encoding respectively).
https://developer.mozilla.org/en-US/docs/Web/HTTP/Compression
Technically, proxies may strip off or add this layer of compression. I'm unclear if it would be expected for proxies to do the same for gRPC compression.
Gzip, deflate, and compress (lzw) are well supported (and included in the http/1.1 spec), though brotli and zstd are becoming more popular too.
Security considerations
The BEAST and CRIME attacks make it possible for an attacker to read secret information. These apply in cases where the attacker can observe the encrypted (tls/etc) data and was able to specify some part of the plaintext to be compressed in the same stream as confidential data. For this reason, any compression must be able to be enabled/disabled by each endpoint, subject to their own threat model and use case. We should default to disabling compression to be sure we "fail safe".
As a developer with limited network resources with CPU time to spare, I want to compress large payloads before they go out on the network.
For the most part, this only applies to Flight/Barrage payloads, our other gRPC messages tend to be small enough to not be worth considering this. With that said, while Flight does have compression options, we should also consider gRPC compression tools.
Arrow
Arrow buffers in RecordBatches can be compressed, see https://github.com/apache/arrow/blob/9287bd7eca701b7cd3fd3e3b8bb30d82fcaea396/format/Message.fbs#L45-L79
Arrow Flight's compression tools appear to be difficult to work with - when initiating a DoGet call, there is no metadata nor message field that can be populated to indicate that the resulting RecordBatches can/should be compressed, or what compression formats the receiver will support. Likewise, when sending a DoPut, while the server will acknowledge the incoming stream in some way, there isn't a negotiation for "can I compress these buffers before sending them". Without some kind of signaling to indicate support, our server can't send uncompressed data without the foreknowledge that all clients will support receiving it, etc.
Java
The Java FlightClient appears capable of reading any compressed RecordBatches, but not writing them -
FlightClient.startPut
instantiates the package-protectedPutObserver
, who's (also package-protected) superclassOutboundStreamListenerImpl
instantiates aVectorUnloader
instance using the constructor that disables compression.Python (and probably C++?)
The pyarrow client appears less limited: the IpcWriteOptions type supports compression information, and that instance appears to be able to be wrapped in a FlightCallOptions instance, then passed to the do_put method to compress buffers as they are serialized. I would then assume that reading via do_get can decompress in the same way.
JavaScript
Since we rolled our own flight library, we need to provide our own decompression implementations. Browsers now come with decompression tools, but at this time they are limited to deflate and gzip, while the only supported compression codecs in Flight are LZ4 and Zstd. This means we'll need to add an external dependency to implement this. With that said, our current JS use cases will be more sensitive to latency and CPU time than to bandwidth, as large payloads will be limited by the time we spend blocking the UI while reading them, and small payloads likely won't be big enough to be worth compressing.
gRPC
gRPC messages can be compressed, either sent by client or server, or both (it need not be symmetrical). Client requests may specify metadata (
grpc-accept-encoding
) signaling supported compression, and the server will respond with what it supports of that list. Signaling that a given stream will be compressed is a different header (grpc-encoding
) This does leave a window where the client might have already sent data that the server can't read - in that case, the server will respond with an error. The converse is not true - the server will already know what compression codecs are supported by the client before beginning to write anything. https://grpc.io/docs/guides/compression/ https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md https://github.com/grpc/grpc/blob/master/doc/compression_cookbook.md https://github.com/grpc/grpc/blob/master/doc/compression.mdWe could make the assumption that the supported server codecs can be negotiated on the first request/response (with authentication, etc), and from there the client knows what compression it can use.
At a glance, it appears that C++, Python, Java gRPC clients should all support this feature, though I'm not sure yet how to turn them on, or how to handle the "check on the first request" signaling. The Java client however only supports gzip out of the box, Python and C++ support deflate as well.
Our JS gRPC client is not homegrown, but also does not support this feature. However, deflate and gzip are both supported by the browser, so we could fork the gRPC client - supporting this would not add additional library size costs (but might still be less helpful than other clients).
HTTP
HTTP also has its own compression options, but I'm unsure yet if standard gRPC clients will respect it. Headers are very similar to gRPC (
accept-encoding
andcontent-encoding
respectively). https://developer.mozilla.org/en-US/docs/Web/HTTP/Compression Technically, proxies may strip off or add this layer of compression. I'm unclear if it would be expected for proxies to do the same for gRPC compression. Gzip, deflate, and compress (lzw) are well supported (and included in the http/1.1 spec), though brotli and zstd are becoming more popular too.Security considerations
The BEAST and CRIME attacks make it possible for an attacker to read secret information. These apply in cases where the attacker can observe the encrypted (tls/etc) data and was able to specify some part of the plaintext to be compressed in the same stream as confidential data. For this reason, any compression must be able to be enabled/disabled by each endpoint, subject to their own threat model and use case. We should default to disabling compression to be sure we "fail safe".