bazelbuild / remote-apis

An API for caching and execution of actions on a remote system.
Apache License 2.0
332 stars 117 forks source link

V3 idea: No longer allow Digest.size_bytes <= 0 #134

Open EdSchouten opened 4 years ago

EdSchouten commented 4 years ago

Right now it is allowed to create Digest messages that have size_bytes == 0, referring to the empty blob. In #131 we're extending the protocol to require that the empty blob is always present, because it can be derived trivially. I personally find this a bit problematic:

I would like to suggest that we simply deny the existence of Digest messages with size_bytes <= 0. Any field where the empty blob needs to be referenced, we should use null. This means that the optimization that Bazel performs of not loading the empty blob becomes the norm, as there is no longer any way to even address the empty blob.

juergbi commented 4 years ago
* For example, what is FindMissingBlobs() on `{hash: "e984d2bdd07318c4e29f7a2ceea4a9e4569e2d8e695a953a4e2df6f69fbdec95", size_bytes: 0}` supposed to do? Report existence, because it has size zero? Or should it report absence, because the empty blob actually has SHA-256 sum `e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855`?

As it's not the correct empty blob, it should report absence.

* It makes the protocol less regular and consistent.

I'm not sure either optimization variant is really better with regards to regularity/consistency. Also with your suggested approach, special cases have to be defined. E.g., what happens if a client attempts to write a blob with a Digest with size_bytes == 0? I'd guess something like INVALID_ARGUMENT.

However, as this approach is overall more efficient and the implementation complexity doesn't seem higher to an optimized v2 implementation, I think it's a sensible plan.

There may be code in some clients and/or servers that currently associates a different meaning to null Digests in some contexts (e.g., Digest not yet calculated). However, I don't think this should be a reason against this idea. Assuming there is no ambiguity anywhere in the protocol itself.

EdSchouten commented 4 years ago

E.g., what happens if a client attempts to write a blob with a Digest with size_bytes == 0? I'd guess something like INVALID_ARGUMENT.

Note that implementations likely already have tests like these for the size_bytes < 0 case. They now only need to be updated to size_bytes <= 0.

EricBurnett commented 4 years ago

One thing to watch out for is type confusion between "unpopulated proto" and "legitimately empty digest" that can cause trouble and make validation more difficult. (See e.g. https://github.com/bazelbuild/remote-apis/issues/6 for the reverse of this, where it's desired that the unpopulated ActionResult protos never be interpreted as a valid message).

If we go down the road of making the empty blob not be a "normal" blob (no digest), would it make sense to have a separate field like empty that must be specified instead? I don't think that makes the special-casing any worse than checking hash==nil && size==0, and it only requires storing 1 extra byte or so per "empty" Digest proto.

erikmav commented 4 years ago

FYI this proposal conflicts with the proposed dynamic execution and two-phase caching in #156 (PR #155) where an empty digest is intended to mean a filesystem entry presented in the worker's filesystem view but not predicted as being needed by the action. See the commentary on the File and Directory message doc comments in the PR.