Vary unused_inputs_list behavior based on which action produces the file

DavidANeil commented 2 years ago

Status Quo

unused_inputs_list allows inputs to be trimmed after an action executes so that the same action will not be re-executed if only the listed unused inputs are changed. In some cases it is possible to determine this list of unused inputs prior to running the action, this is known as "input discovery" or "input pruning". While many rulesets could take advantage of this, it is not exposed to Starlark rules. The builtin C++ rules do take advantage of this, and there is discussion about allowing a C++ specific version of input pruning for the Starlark version of the rules (see #13871).

Description of the Feature Request

As described by @bjacklyn in https://github.com/bazelbuild/bazel/pull/13871#issuecomment-948051404, and discussed in the Q&A of the BazelCon21 stream: it seems that it should be possible for unused_inputs_list to be extended in a meaningful way to allow all Starlark rules access to input discovery. If the File referenced in a unused_inputs_list attribute is also listed in the outputs of that action, then the current behavior is maintained: the inputs are trimmed after the action executes. If the unused_inputs_list File is listed under inputs, then the inputs are trimmed before the action is scheduled for execution, including not being part of the lookup key for the action cache. If the action does execute, then the listed unused inputs will not be included in the sandbox. If the unused_inputs_list is not listed under inputs nor outputs, then the build should fail.

Example Usage

ctx.actions.run(
    inputs = source_inputs,
    outputs = [unused_inputs_file],
    arguments = [discover_args],
    executable = ctx.executable.input_discoverer,
    unused_inputs_list = unused_inputs_file,
    mnemonic = "DiscoverInputs",
)

ctx.actions.run(
    inputs = depset([unused_inputs_file], transitive = [source_inputs])
    outputs = outputs,
    arguments = [args],
    executable = ctx.executable.compiler,
    unused_inputs_list = unused_inputs_file,
    mnemonic = "Compile",
)

In this example, both actions use unused_inputs_list. The action that produces it uses it to trim its inputs after execution. The Compile action uses it to trim its inputs before execution, as a form of input discovery.

mdempsky commented 1 year ago

What's the likeliness of this being implemented in the nearish future?

The Go compiler would like to be able to rely on it to address build scalability issues.

DavidANeil commented 1 year ago

I spent a few hours poking around trying to get it work, but my unfamiliarity with Java and the Bazel architecture made my experiment unsuccessful. I'd still love to see this feature. I think it wouldn't be all-too-difficult to implement, and could vastly improve build times and cache hit rates in some common cases.

mdempsky commented 1 year ago

Actually, thinking about this some more, I think it makes sense to have an orthogonal feature for early pruning of the inputs list, if at all.

Suppose dependencies X->Y->Z, where a priori the build system can't tell whether X depends on outputs from Z.

If when compiling Y we're able to determine that Z will never be needed by X, that's useful information so that X's cache key doesn't need to include Z. But also that Z doesn't even need to be available to X's compile action (e.g., to reduce network traffic in the case of remote execution).

But separately there's the possibility that users of Y may need Z, yet X is just a particular target that doesn't. It's still useful in this case to know that changes to Z are irrelevant to X, to improve incremental rebuild times.

So they're complementary, not competing, features.

comius commented 1 year ago

cc @lberki, could use some guidance if this ideas are viable / should be triaged to P3

lberki commented 1 year ago

My line in the sand is that if we implement this feature, it should be possible to implement C++ include scanning behind it because I don't want to support two independent mechanisms for input discovery indefinitely.

This has a number of implications:

It's probably not feasible to relegate the creation of an "input unused_inputs_list" to a separate action, because it would mean that every C++ compilation action would come with a separate input discovery action, which would very probably be an unacceptable amount of memory overhead (I'd be happy to be proven wrong on this one, though)
The performance requirements are pretty though. In particular, a fork/exec per input discovery is probably too expensive.
It should be thought through whether there are any issues with metadata handling: C++ input discovery is very special in that when Bazel reads a file, its metadata is not necessarily in InputMetadataProvider yet; it's a wart, yes, but it's proven pretty though to fix without incurring a performance hit.

bazelbuild / bazel