feature request: blob level compression

mostynb commented 4 years ago

In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs for the languages I checked require implementors to provide data in unserialized and uncompressed form.

This has two (related) practical downsides for REAPI caches: 1) Cache servers who want to support compression will end up recompressing the same data many times. 2) Cache servers who choose to store blobs in compressed form will also end up re-decompressing the same data many times.

Contrast this with Bazel's more primitive HTTP remote cache protocol, where it is trivial for cache servers to store and serve compressed blobs by inspecting the request's Accept-Encoding header.

I think it might be worth investigating adding optional blob level compression to REAPIv3, to avoid these downsides.

This might involve: 1) Adding a new optional metadata field to the byte stream API's resource name to specify the compression format for uploads 2) Adding a new optional metadata field to the byte stream API's resource name to specify a set of acceptable compression formats for downloads. 3) Updating the Capabilities service for servers to announce supported compression formats. 4) Updating the ActionCache and ContentAddressableStorage services to allow clients to selectively upload and request for download inlined blobs in particular compressed formats.

If we decide to implement encryption and authenticity checks (#133), then we may need to consider these two features at the same time during the design phase.

edbaunton commented 4 years ago

Huge +1 from me. I was talking about this just today! We want to compress data in our in-memory CAS implementation. We have been experimenting with using compression at the gRPC layer but that is, as you said, completely undone when you put it in the CAS.

My immediate thought was to model these the same way we model the different digesting algorithms: via capabilities. As I understand things today, you cannot switch between digesting algorithms without restarting everything from scratch (including the CAS, although that can support multiple digests simultaneously per capabilities), so, if we want to support it in v2, we could just add a new capability called Compression. That could be dictated by the server just like digesting function is today. Apart from the inelegance, what would the downsides be there that I have missed?

EricBurnett commented 4 years ago

Convenient timing - we've just been working through trying to leverage gRPC transport-level compression ourselves, and finding it very difficult to capitalize on in practice, to the point we're giving up on it for now. So +1 to exploring API-level compression instead.

What we've gotten to in our exploration thus far is:

The ideal design (from our perspective) keeps blobs addressed by the digest of their uncompressed payload, even when transported with compression. (I think that's consistent with Mostyn's suggestion, and not Ed's?) This stems from a few motivations:
- Not having to pay the cost of compression when making existence checks. (Right now, digest the file or get the digest from some other layer (e.g. fuse). Undesired: have to compress and then digest the file, as even the fastest compression is slower than SHA-256 today, and SHA-256 has optimized instructions already landing in the most recent hardware.)
- Keeping interop with clients who don't understand compression.
- Not wanting the CAS to be strongly coupled with one exact compression format (what we use for storage is not exactly what we want to use for transport, and I could imagine wanting to support other compression algorithms later as they're invented).
- zstd-1 is fast and cheap enough to compress and decompress data on the fly. Much better than all other options we evaluated, and in particular, gzip materially worse.
- There are a few wrinkles w.r.t ByteStream, in particular related to offset reads and writes. If you don't support offsets, ignore this). I generally think offset and read_limit should represent offsets into the uncompressed bytes, which means you have to be willing to recompress on the fly. Otherwise offsets are only usable for resuming broken reads/writes with a continuation of the exact byte sequence, but not for any other use-cases (pagination, appending chunks to a log, etc); read_limit is semantically nonsense (truncated compressed stream?) unless referring to the uncompressed bytes. (Some of this doesn't apply to the CAS as usually used, but e.g. we use ByteStream for log streams (see other bugs/threads), and I'd like consistent semantics across these.)

Mostyn, to your ideas specifically:

Agreed
Allowing multiple optional compression algorithms for response has a wrinkle that's messy - how does the server communicate which compression the response is compressed with for a ByteStream response? And it's mostly obviated by #3, where you can learn the supported protocols directly. So I'd prefer to keep this one simple and just say "specify the compression format for downloads", with a format the server has declared support for in Capabilities.
Agreed.
Note that this one will have diminishing returns - ByteStream will very likely remain the majority of bytes moved and so represents the majority of potential impact, and for small blobs there's a threshold below which it's not actually worth compressing the bytes anyways (takes too long vs the bytes saved). I suspect for ContentAddressibleStorage it might be worth compressing the blobs as a group but not individually, or compressing some blobs but not others (and yes, doing so on the fly either way); for ActionCache probably not worth it at all for inlined blobs? But I'd actually suggest we start with neither, for simplicity, and just focus on ByteStream initially.

mostynb commented 4 years ago

My immediate thought was to model these the same way we model the different digesting algorithms: via capabilities. As I understand things today, you cannot switch between digesting algorithms without restarting everything from scratch (including the CAS, although that can support multiple digests simultaneously per capabilities), so, if we want to support it in v2, we could just add a new capability called Compression. That could be dictated by the server just like digesting function is today. Apart from the inelegance, what would the downsides be there that I have missed?

One problem with the Capabilities API is that it only describes the server side- the server can't use it to check capabilites of the client. The client would need to specify its preferences/capabilites separately, eg in the resource_name for the byte stream API, or maybe via new fields in the other protos.

(This isn't a problem for digests because they have mostly non-overlapping key sizes, but I would prefer a nicer fix for this in v3 too, see eg #136.)

mostynb commented 4 years ago

Convenient timing - we've just been working through trying to leverage gRPC transport-level compression ourselves, and finding it very difficult to capitalize on in practice, to the point we're giving up on it for now. So +1 to exploring API-level compression instead.

What we've gotten to in our exploration thus far is:

The ideal design (from our perspective) keeps blobs addressed by the digest of their uncompressed payload, even when transported with compression. (I think that's consistent with Mostyn's suggestion, and not Ed's?) This stems from a few motivations:

+1, I think we should always offer uncompressed blobs, since the cache is oblivious to what is being stored. Only clients have the information about which blobs are likely to benefit from compression, so IMO it should be opt-in from the client side.

There are a few wrinkles w.r.t ByteStream, in particular related to offset reads and writes. If you don't support offsets, ignore this). I generally think offset and read_limit should represent offsets into the uncompressed bytes, which means you have to be willing to recompress on the fly. Otherwise offsets are only usable for resuming broken reads/writes with a continuation of the exact byte sequence, but not for any other use-cases (pagination, appending chunks to a log, etc); read_limit is semantically nonsense (truncated compressed stream?) unless referring to the uncompressed bytes. (Some of this doesn't apply to the CAS as usually used, but e.g. we use ByteStream for log streams (see other bugs/threads), and I'd like consistent semantics across these.)

I'm also interested in exploring other potential solutions, eg: introduce a new Blob proto which contains a list of one or more Digests with corresponding encodings. That might make seekable compressed byte streams easy, since we could address compressed blobs directly.

Mostyn, to your ideas specifically:

Agreed

Allowing multiple optional compression algorithms for response has a wrinkle that's messy - how does the server communicate which compression the response is compressed with for a ByteStream response? And it's mostly obviated by #3, where you can learn the supported protocols directly. So I'd prefer to keep this one simple and just say "specify the compression format for downloads", with a format the server has declared support for in Capabilities.

Hmm, right. It would be good to avoid needing to look for magic bytes in the response.

Agreed.

Note that this one will have diminishing returns - ByteStream will very likely remain the majority of bytes moved and so represents the majority of potential impact, and for small blobs there's a threshold below which it's not actually worth compressing the bytes anyways (takes too long vs the bytes saved). I suspect for ContentAddressibleStorage it might be worth compressing the blobs as a group but not individually, or compressing some blobs but not others (and yes, doing so on the fly either way); for ActionCache probably not worth it at all for inlined blobs? But I'd actually suggest we start with neither, for simplicity, and just focus on ByteStream initially.

+1 to focusing on ByteStream first, and worrying about the other services later.

erikmav commented 4 years ago

A note for implementers: With this change you're likely to need to keep a map from:

(uncompressed digest) -> list-of (compressed digest reference in service CAS, compression algo)

The reason being, you're likely sooner or later to want to cache computation by re-compressing files into the algorithm(s) that (a) provide the best compression ratio, and (b) are most provided in the accepted content format list from clients.

Kind of like if you're streaming online video to a 4K TV and a phone with low res, the back-end might dynamically recode for the small screen on first call to provide fewer bits on the wire, but keep the result (or kick off a parallel job to create) and store a low res version of the high res content.

johnterickson commented 4 years ago

Hi! @erikma pointed me to this thread.

I'm the one responsible for the weird VSO0 hash. The reason that the VSO0 has is weird is that S3/Blob support the concept of uploading large objects in blocks. The VSO0 hash allows the server to verify the hash of each of these pieces independently. i.e. if you have a 10GB file, at no point does the server need to stream the data to a temporary location as it hashes it and then only copy it to a final location once the hash is verified.

This gets tricky with compression. Some options:

Compress the whole file first, then upload blocks of the compressed file. As @EricBurnett points out, this means you have to compress to do an existence check - yuck.
Compress each block individually. This could be ok, but it means the downloader would need to understand that the content is not an uncompressed stream, nor a compressed stream, but is a concatenation of multiple compressed streams.
Split files into pieces and store the pieces in the CAS. Each piece is small enough that the it can be hash-validated in memory. This is what we do for Pipeline Artifacts.

To be clear, I'm not saying "don't do compression because of VSO0" 😊 - just that it wouldn't make sense for that particular hashing/upload mechanism.

erikmav commented 4 years ago

(VSO0 == VsoHash in the digest algorithm list in the RE protobuf)

EricBurnett commented 4 years ago

Compress each block individually. This could be ok, but it means the downloader would need to understand that the content is not an uncompressed stream, nor a compressed stream, but is a concatenation of multiple compressed streams.

Many compression algorithms, including zstd (my recommendation for folk to choose), are concatenatable: you can pipe a sequence of individual compressed streams to a single decompressor. This gives you the option of uploading or storing individual chunks if you wish, but streaming them out to a single client in sequence without intermediate recompression.

at no point does the server need to stream the data to a temporary location as it hashes it and then only copy it to a final location once the hash is verified.

Assuming you can rename a file in the destination system, this is not really a bad thing imho, and frequently desired anyways - streaming to the final location directly causes problems with concurrent uploads, slow uploads, partial uploads, etc. If you're storing in independent chunks with a small enough chunk size this is maybe not a problem (10s of kilobytes say), but that's then bad for disk IO and volume of metadata since you have highly fragmented files. But even at 1MB chunks I'd still recommend a temp-file-and-rename pattern. (Though storage systems differ wildly; your mileage may vary).

That said, there are definitely valid reasons to chunk on upload, including getting single-file parallelism if you can upload multiple chunks faster than streaming a whole file in sequence. Not currently supported in the CAS API directly, but if you want to side-channel data into your storage through some other API and expose it through the CAS for subsequent interactions, should work well!

erikmav commented 4 years ago

For single-file parallelism: We should consider a ByteStream change that specifies start offset and size within an uploading file. I have a note in our implementation of ByteStream to go experiment with custom headers for that, as we're often uploading one large file and overlapping chunks already works well for saturating the pipe for downloads (which do support reading from offsets).

mostynb commented 4 years ago

A note for implementers: With this change you're likely to need to keep a map from:
(uncompressed digest) -> list-of (compressed digest reference in service CAS, compression algo)

If cache implementations were to expose that to clients somehow, then compressed blobs would be directly addressable and we would not need any changes to the download part of the bytestream API, and I suspect VSO0 would not pose any trouble. If so, then I wonder if we can find a reasonable way to do this in v2?

A new rpc in the ContentAddressableStorage service could do this at the cost of an extra roundtrip. Otherwise we could add new fields to existing messages like OutputFile.

johnterickson commented 4 years ago

Assuming you can rename a file in the destination system

Unfortunately neither Azure Blob nor S3 offer object rename 😕

They do offer partial uploads that handle concurrency well: "When you upload a block to a blob in your storage account, it is associated with the specified block blob, but it does not become part of the blob until you commit a list of blocks that includes the new block's ID"

https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs

EricBurnett commented 4 years ago

John: ahh right, didn't realize that. I took a quick look at the APIs: for S3 you'd start (potentially multiple independent) multi-part uploads for a blob, upload bytes as they come hashing along the way, and then when complete and valid one writer would 'commit' the blob making it visible. That's similar to the concept of temp file + rename, you just don't have to pick your own id for the temp file (it's provided by S3 in the form of the upload ID). For Azure it looks like you would have multiple writers writing (potentially redundant) blocks to the same logical blob under different block IDs, and when one of them was done, it'd commit the block list with all the blocks it uploaded and knew were correct, implicitly discarding the rest. So you're right, no 'rename', but the same abstract idea of "stream bytes in without making them visible, hashing as you go, and then commit the finalized blob when it's present and valid" should work fine.

Erik: IIRC, VSO0 is a hash-of-hashes? For single-file parallelism, using ByteStream to upload non-contiguous chunks of the same file independently seems tricky. Independent uploaders uploading independent chunks of the same file would be tricky, as the server couldn't validate anything about them till all chunks were seen, and it might have seen multiple different candidates for any given byte range (since it does not yet know what digests to expect for chunks). I guess the server would keep track of what chunks of the whole had been uploaded as candidates, and when it thought the whole file was covered, it could try using them to complete the top-level hash and make the whole blob visible?

I might suggest a different pattern for the same - upload each of those N chunks as different blobs, then use another (not currently present) API to synthesize one blob out of multiple - akin to Azure's use of uploading a chunk list to create a "final" blob. In your case you could have hashed each of those N blobs pre-validated server-side with the appropriate algorithm on upload, such that creating an N-blob synthesis with VSO0 hash would be quick to stitch (validate all referenced blobs are actually present; hash the hashes of them to ensure it produces the right overall hash, then store metadata reflecting that this blob is actually composed of

). On Wed, Jul 15, 2020 at 8:59 PM John Erickson wrote: > Assuming you can rename a file in the destination system > > Unfortunately neither Azure Blob nor S3 offer object rename 😕 > > They do offer partial uploads that handle concurrency well: "When you > upload a block to a blob in your storage account, it is associated with the > specified block blob, but it does not become part of the blob until you > commit a list of blocks that includes the new block's ID" > > > https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . >

johnterickson commented 4 years ago

multi-part uploads for a blob, upload bytes as they come hashing along the way, and then when complete and valid one writer would 'commit' the blob making it visible.

@EricBurnett This is exactly what my service (https://azure.microsoft.com/en-us/services/devops/artifacts/) does 👍

independent uploaders uploading independent chunks of the same file would be tricky, as the server couldn't validate anything about them till all chunks were seen, and it might have seen multiple different candidates for any given byte range

This works in our service as Azure Blob lets you name each of the chunks - so we name each chunk by the hash of the chunk. The overall flow:

Client breaks file into 2MB blocks and hashes each block.
Client takes a hash of the block hashes - this is the content id.
Client tells server "please add a reference to this content id". If the service says "yes" then it's done!
Otherwise, for each block: a. The client does a parallel POST of the block to /[content id]/[hash of block]. b. The server validates that [hash of block] from the URL matches the hash of the body and then writes the block with name [hash of block] to Azure Storage to the blob with the name [content id]. c. Concurrent uploads from multiple clients can safely race because the server is always verifying the content of each block and thus a block of a given name (it's hash) will - at worst - be overwritten with the same content.
The client "seals" the blob by doing a PUT of the list of block hashes to /[content id]. The server validates that the hash of the block hashes in the body equals [content id]. If so, it seals the Azure Blob with that list of blocks. Again safe to race on this because everyone is performing the same operation

mostynb commented 4 years ago

Here's an initial proposal, which makes it possible to address blobs by their compressed digests, so read/write offsets still work: https://docs.google.com/document/d/1diqigT1gJyIkiVjvayvCU2GDbJq3nvDZFEfO7EZPPwQ/edit?usp=sharing

EricBurnett commented 4 years ago

Thanks Mostyn! I'm quite happy with how the proposal is shaping up, personally - does anyone else have comments on Mostyn's current proposal? (Note that it has been materially revised since first linked 2w ago).

I'll note the current proposal does allow offset reads still (a request of @erikma ), but otherwise doesn't specifically consider block algorithms, and doesn't do anything specific around block uploads or concurrent partial uploads (as @johnterickson floated). I'm personally fine with this - it's consistent with the existing upload API, and (as I noted above) my suggested approach to block uploads would be to upload independent chunks as independent blobs and provide an additional API to stitch them. In that sense I think chunk uploads are out of scope for this Compression discussion, except to the extent anyone feels this proposal precludes features they want to propose.

peterebden commented 4 years ago

I think the proposal sounds good in regard to integration with the bytestream APIs, and I like the idea overall and think it can be useful.

I'm a little concerned about the interaction with other CAS RPCs which would presumably require the server to decompress before serving them (although presumably this is the same for a client that requests a blob via the uncompressed bytestream paths, or a compressed one that isn't actually stored compressed). In general (as I noted on the doc) it gives the decision-making power to the client, but the server has better info to make that decision.

I also have a bit of a concern on focusing just on bytestream; our artifacts consist mostly of Go objects (which are nearly all compressible and mostly small enough to be batched) or Python and Java (which are often incompressible and large enough to require streaming). I worry a bit that we'd not get a lot of benefit given that profile. Maybe others are in a different situation though?

EricBurnett commented 4 years ago

Thanks for reviewing Peter.

it gives the decision-making power to the client, but the server has better info to make that decision.

I hope we can distill the decision making to roughly "prefer compressed", making that a no-op. E.g. if the server is going to store data internally compressed anyways, it's a choice between "server decompresses and then sends" vs "server sends, client decompresses", where the latter is ~always better. Mostyn and I also discussed incompressible data, but we felt it was sufficient to use a faster compressor (e.g. one of the zstd negative levels) and/or a framing-only "compression" (e.g. gzip level 0 if you support gzip, which just wraps the uncompressed data in gzip headers) rather than to try to push knowledge of compressibility into the client.

I also have a bit of a concern on focusing just on bytestream

That's a fair point, if your use-case is biased more towards smaller batch reads than bytestream reads, you won't benefit as much right away. We're definitely not closing the door to adding a compressed version of the batch APIs, simply not starting there. Could I ask you to gather stats on how many bytes you move by ByteStream vs batch? I'll be interested to know how much of a concern it is for you in practice. On my server it's ~14:1 for reads, but you may have materially different numbers.

peterebden commented 4 years ago

Sure, I can have a look at that.

And good point on the fast or framing-only compression levels, that makes sense. Thanks.

mostynb commented 4 years ago

In general (as I noted on the doc) it gives the decision-making power to the client, but the server has better info to make that decision.

IMO the client has more context to decide, eg:

Action/rule type.
Filename/extension.
Local configuration, such as marking particular actions or files as "do not compress".
(For uploads) uncompressed data, + compressed data (if willing to compress before sending).

Whereas cache servers have at most compressed and uncompressed data sizes.

I also have a bit of a concern on focusing just on bytestream

That's a fair point, if your use-case is biased more towards smaller batch reads than bytestream reads, you won't benefit as much right away. We're definitely not closing the door to adding a compressed version of the batch APIs, simply not starting there.

If it looks like it's worth the effort, we could add a new compression_type field to the OutputFile message which refers to the contents field, and an acceptable_compression field to GetActionResultRequest.

peterebden commented 4 years ago

OK so I have a couple of days worth of metrics now and it's approximately 3.5x in favour of streams for reads (more for writes). I don't have a lot of insight into compressibility of them but that definitely supports the idea that streams are indeed more valuable as you say.

Thanks for the discussion - this SGTM then!

EricBurnett commented 4 years ago

Great, thanks for the data Peter!

I don't have a lot of insight into compressibility of them

I don't know what they'll be for you, but on my side I observe compression ratios around 0.4 for writes and 0.333 for reads. If that held ballpark true for your data, and if your streams and batch reads were equally compressible, you'd expect an ideal reduction of about 67% bytes read, and an effective reduction of about 52% with streams alone.

That suggests it's definitely worth starting with streams to get the most impact with the least effort, but that you'll probably still desire to follow on with batch APIs too, as you'd expect an incremental 30% reduction going from streams-only compression to streams-and-batch compression.

bergsieker commented 4 years ago

It sounds like we're generally in agreement on Mostyn's proposal as written. @mostynb, can you turn your doc into a PR?

mostynb commented 4 years ago

Will do.

mostynb commented 3 years ago

This feature was added some time ago now.

bazelbuild / remote-apis

feature request: blob level compression #147