REv3: Reduce asymmetry between O(n) output files and O(1) output directories

EdSchouten commented 10 months ago

I've noticed that people sometimes craft rules (e.g., packaging rules) that have many, many, many output files. This is problematic, for the reason that it makes ActionResult very large. So large that it can't be sent back to the client. ActionResults are not stored in the CAS. This means that clients can't download them in a streaming manner. They are returned as part of the gRPC response, which can generally only be up to 4 MB in size. To prevent Buildbarn from generating ActionResults that are this big, I generally tell my users to configure their clusters to only allow processing of Command messages up to 1-2 MB. As the size of ActionResult is generally proportional to that of Command (1-2x as big), that tends to work.

A solution I often give to my users when they hit these limits is that they should use output directories instead of plain output files. In that case ActionResult remains small. It will only contain a small number of output_directories containing references to Tree objects. These can be streamed from the CAS. There are also a couple of advantages on top of that:

If all paths share long pathname prefixes, Tree objects can become more compact than listing all pathnames explicitly. So a net reduction in network traffic.
In case a repeated invocation of an action yields the same output files, you end up with two large and mostly identical ActionResults. When using output directories, both ActionResults will share the same Tree object. This means that if a client is somewhat smart about caching results, incremental builds consume less traffic.

Though I fully understand where the asymmetry comes from, I do think it's hard to sell to our users. Why is there a difference? From their perspective it's 'tomato tomato'.

I think that as part of REv3 we should investigate whether we can reduce the noticeable differences between using O(n) output files and O(1) output directories. For example, what if every action returns exactly 1 directory hierarchy of outputs, and Command's output_paths merely acts as a filter for what needs to be captured as part of that directory hierarchy?

(Relatedly, is there anything we can do to reduce the size of Command's output_paths by preventing repetition of leading pathnames?)

sluongng commented 10 months ago

By reducing the output_paths down to something like output_tree_digest, we would be able to achieve a less variable size ActionResult? But the tradeoffs seems to be that the client will have to make subsequent RPC calls to get the content.

Combining with the discussion in https://github.com/bazelbuild/remote-apis/pull/258, its screaming out to me that we need a separation between the actual cache content (ActionResult blob that is kept on the server side) vs the presentation layer that the client would download. So some clients could opt to get an output tree digest that would make the result relatively smaller, vs having all the paths laid out in the response. Or perhaps some clients would only be interested in directories.

(Relatedly, is there anything we can do to reduce the size of Command's output_paths by preventing repetition of leading pathnames?)

Isn’t this what compression solve out of the box? Or are you concern about the decompressed size?

EdSchouten commented 10 months ago

Hi @sluongng,

By reducing the output_paths down to something like output_tree_digest, we would be able to achieve a less variable size ActionResult? But the tradeoffs seems to be that the client will have to make subsequent RPC calls to get the content.

Note that this issue is written in a very open ended manner. I am not suggesting that we should literally replace all outputs by a single Tree message, or fully decomposed into separate Directory messages. All I'm stating is that we should investigate whether we can eliminate the asymmetry between output files and directories.

One could consider encoding the hierarchy of output files using some kind of adaptive data structure, such as B-trees/B+-trees. Or its Merkle tree variant: Prolly trees. If those are sufficiently small, they can be embedded into ActionResult directly. But if they get too large, they can be carved up into additional pieces. Such an approach would actually end up reducing round trips for actions yielding small/shallow output directories compared to what we have now.

Combining with the discussion in https://github.com/bazelbuild/remote-apis/pull/258, its screaming out to me that we need a separation between the actual cache content (ActionResult blob that is kept on the server side) vs the presentation layer that the client would download. So some clients could opt to get an output tree digest that would make the result relatively smaller, vs having all the paths laid out in the response. Or perhaps some clients would only be interested in directories.

To me it's screaming out that as part of REv3 we should do a better job at designing a more efficient representation, so that there's no need for clients to choose between one or the other. As part of REv2 I think #258 strikes a decent balance between impact and ease of shoehorning it into the existing protocol, but I don't consider it to be ideal.

(Relatedly, is there anything we can do to reduce the size of Command's output_paths by preventing repetition of leading pathnames?)

Isn’t this what compression solve out of the box? Or are you concern about the decompressed size?

Compression doesn't solve this if you ask me. The working group agreed on letting compression essentially be an overlay. There is no such thing as a canonical representation of a compressed object. This means that if we start to rely on enforcing maximum limits based on the compressed size of objects, it becomes non-deterministic whether a build is allowed to take place or not.

Furthermore, Protobuf libraries don't really support parsing Protobuf messages in a streaming manner. This means that even if you were to compress a massive Protobuf message containing many redundant strings into something small, you need to pay the cost at some point decompressing it and storing it in the worker's memory.

With regards to expressing output paths, instead of using the simple repeated string output_paths like we have right now, one could do something like this:

message OutputPathsSelection {
  // If empty, indicate that the current file or directory needs to be captured.
  // If non-empty, don't capture the current file or directory. Instead, capture
  // files or directories at or below the provided filenames.
  map<string, OutputPathsSelection> children = 1;
}

message Command {
  ....
  OutputPathsSelection output_paths = 123;
}

So for example, if you wanted to capture these two paths:

src/foo.d
src/foo.o

You would provide this command:

{
  "outputPaths": { "children": {
    "src": { "children": {
      "foo.d": {},
      "foo.o": {}
    } }
  } }
}

This might look convoluted in JSON, but at the Protobuf wire format level it should be fairly compact. Such a representation has a couple of advantaged over the existing list of repeated strings:

For workers it's easy to combine this approach with file system traversal. One can write a recursive method that traverses both the OutputPathsSelection message and the file system at once.
By only storing filenames instead of full pathnames, we eliminate a place where the protocol currently assumes forward slashes, thereby making it more friendly to implement on systems like Windows. It also means we don't need to explicitly state as many corner cases (e.g., such as output paths not being allowed to contain trailing slashes).

I tested this representation for some random Bazel test that gets built on my cluster. It looks like this test currently has 2201 bytes of output paths (newline delimited, so larger when stored in a repeated string. Using the Protobuf message above, I was able to compress it down to 550 bytes.

bazelbuild / remote-apis

REv3: Reduce asymmetry between O(n) output files and O(1) output directories #281