bazelbuild / remote-apis

An API for caching and execution of actions on a remote system.
Apache License 2.0
332 stars 117 forks source link

V3 idea: Let Digest.hash use 'bytes' instead of 'string' #135

Open EdSchouten opened 4 years ago

EdSchouten commented 4 years ago

Right now the Digest message uses a string to store the checksum of the object. This is wasteful, because for the hashing algorithms that we use, the output is always a sequence of bytes. Using a bytes field cuts down the storage space by a half (for that field; not necessarily for the full object).

EricBurnett commented 4 years ago

Hi Ed! Thanks for filing this.

We've discussed this a couple times within Google in 2017/2018, so copying out the conclusions we came to from one of those evaluations as a starting point. I think we may also have discussed it in the community in the old BuildFarm group as well, but I couldn't find a handy reference for that discussion. Regardless, what we came up with then was:

Arguments in favor of representing hashes as bytes when sent over the network:

Arguments against:

At the time, the deciding factor seemed to come down to whether the difference in bandwidth was worth the effort of changing and the increase in complexity in dealing with two formats. (Noting that for backend storage you can use whatever you want, so it's only the API that strictly must be defined one way or another). This could be argued on the grounds that either that the cost of the bandwidth is material, or that builds are measurably slowed (wallclock) because of the increase in time required to move the bytes around.

Ultimately, we didn't find this to be a particular bottleneck, and so the value of changing it seemed insufficiently high to justify a change. Of course, time has passed, Bazel Without the Bytes is now live and used in more contexts (and other clients have similar download-avoiding optimizations), etc, so - if you (or anyone else) wants to do a fresh analysis to argue that it's now sufficiently material to warrant the change, please feel free! But understand that a data-driven argument is going to be required here, as the effort in revisiting this (and implementing the change if that's what we come to) is significant.

Cheers, Eric

erikmav commented 4 years ago

Microsoft AnyBuild team is in favor of this change, our caching implementation internally uses only bytes so strings are an impedance mismatch except for text logging. Plus for equality comparisons a case-insensitive string comparison is orders of magnitude slower (more CPU expensive) than an optimized byte comparison implementation.

ulfjack commented 4 years ago

I'm also in favor of this change. We're using Digest as a key in a few in-memory data structures, and the 2x has shown up as a significant contributor to memory consumption for us. We're also using Digest in some cases for machine-to-machine communication, so we currently have to convert back and forth with two separate classes for digests. On top of that, hex digests also require an additional consistency check: Are all characters valid hex digits?

I think the memory consumption / conversion cost may also apply to Bazel - IIRC, it stores digests as binary to avoid the 2x memory blowup, but then it has to convert every time it talks to the network. The alternative would seem to be preferable. Though note that this isn't the biggest problem in Bazel right now - there are still a bunch of things that can be improved before this would even be on the radar.