bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.98k stars 4.03k forks source link

Rule finalizers #23160

Open tetromino opened 1 month ago

tetromino commented 1 month ago

We wish to add rule finalizers - special symbolic macros (see #19922) which - regardless of lexical position in a BUILD file - are evaluated in the final stage of loading a package, after all non-finalizer targets have been defined. Unlike ordinary symbolic macros, a finalizer can call native.existing_rules(), where it behaves slightly differently than in legacy macros: it only returns the set of non-finalizer rule targets. The finalizer may assert on the state of that set or define new targets.

Evaluating a rule finalizer necessarily requires loading the entire package (except for any other finalizers), preventing lazy symbolic macro evaluation. To maximize the benefit of lazy evaluation, we will need a flag to control whether to force finalizers to be evaluated.

Motivation

Many teams at Google have project- or framework-specific "epilogue macros", which must be invoked at the end of a BUILD file, and which use native.existing_rules() to do some of the following:

I suspect that the same holds in other sufficiently large Bazel (mono)repos. Even rule set repos sometimes declare epilogue macros - for example, flags.FLAGS() in rules_android.

Use of epilogue macros has the following drawbacks:

Concrete proposal

Consider a theoretical finalizer:

# .bzl file

def _my_finalizer_impl(*, name, attr1, **kwargs):
    manual_tests = []
    for rule in native.existing_rules().values():
        if rule["kind"] == "foo_library" and not rule.get("testonly", False):
            # All new targets must be namespaced under the finalizer's name prefix
            test_name = "%s_%s_manual_conformance_test" % (name, rule["name"])
            manual_tests.append(test_name)
            _conformance_test(
                name = test_name,
                deps = [":" + rule["name"]],
                tags = ["manual"],
                attr1 = attr1,
                ...
            )

    # Work around inability to select manual tests into a test suite
    test_suite(
        name = name + "_manual_conformance_tests",
        tests = manual_tests,
    )

my_finalizer = macro(
    rule_finalizer = True,
    doc = "...",
    implementation = _my_finalizer_impl,
    attrs = {
        "attr1": ...,
    },
)
# BUILD file

load("//x/y:my_finalizer.bzl", "my_finalizer")

# Can be in any position in the BUILD file
my_finalizer(
    name = "my_finalizer",
    attr1 = ...,
}

The native.existing_rules() call in the context of a finalizer will always return a deeply immutable map which contains all non-finalizer-defined rule targets, and does not contain any targets declared in any finalizer (including the currently running one!), ensuring that (a) we can cache the existing_rules map and reuse it for multiple finalizers and (b) finalizer order does not matter.

Implementation in Bazel

A finalizer is a special kind of macro, which can be implemented as a subclass of MacroFunction or as a MacroFunction with a special flag. In particular, lazy/optional evaluation of finalizers (when allowed by a Bazel flag) is taken care of by lazy evaluation of macros.

Package.Builder already has some support for multi-stage building (see buildPartial()); this proposal would add a finalizer stage, the first step of which (if any finalizers are found in the package) would be to save a snapshot of the existing_rules snapshottable bimap.

Interaction with symbolic macros

Finalizers may not be called in a symbolic macro; but finalizers may call symbolic macros and other finalizers (and, of course, legacy macros).

The existing_rules map accessible to a finalizer contains targets regardless of their visibility or which macro they were declared in (note that symbolic macros may declare a target which is not visible to the package where it lives) - a finalizer might assert on a rule target's attributes rather than use it as a dependency, and hiding targets from the finalizer when they are visible to a genquery would be an odd choice.

Lazy evaluation

To evaluate a finalizer, Bazel must load the complete set of the package's non-finalizer targets - we don't want the finalizer's behavior to vary unpredictably depending on which part of the package was evaluated. In this, a finalizer is no different from an epilogue macro. But the advantage of the finalizer is that it's self-contained, intentional, isolated from other finalizers, and there is a way to tell that we don't need to evaluate it:

if we don't need need any targets declared under the finalizer's namespace if a user doesn't care about running finalizer checks (as expressed by a Bazel flag)

In situations where a repo's policies need to be enforced - for example, in CI/CD - Bazel would need to run with a flag that forces evaluation of all macros and finalizers in the package.

By contrast, in situations where users need low latency of loading - for example, in an IDE when opening a BUILD file - one would certainly want to make evaluation lazy (and therefore not evaluate finalizers) unless a target under the finalizer's namespace was requested.

matts1 commented 1 month ago

This seems like a great idea, but more commonly than not, these "finalizers" don't actually run at package level granularity. They run on whole trees of files.

For example, I've previously worked on a project where we required every build file to have a conformance_checks() rule at the bottom of a file. The problem was, this was kind of a pain to put at the bottom of every build file - It was very easily forgotten, and very frequently would result in having to re-upload several minutes later after your presubmits failed.

I'd propose that finalizers may optionally be transitive. That is, if //foo:my_finalizer is a transitive finalizer, native.existing_rules(transitive=True) would return //foo/bar/baz:* in addition to //foo:*. This means that you could have a single finalizer for your whole project, instead of having to specify it in every single build file.

The only concern I have with such an approach would be the potential longer loading phase time increase when loading top-level build files, but I think that in practice, most of the time when you load top-level build files, they depend on the whole tree under them anyway.

Also, side question: How willl finalizers interact with export_files and non-exported files?

tetromino commented 1 month ago

@matts1 - the idea is to run finalizers in the loading phase. Loading phase means each package is in a separate Starlark (and Java) thread, and the packages cannot see each others' content. See https://bazel.build/extending/concepts#evaluation-model

But what we could do instead is register default finalizers.

This default finalizer registration a part of my proposal internally, but we decided to shelve it for now because of worries about performance.

matts1 commented 1 month ago

That sounds like an interesting idea.

What were the specific performance concerns? Was it hundreds of matchers which you have to match every package against being a problem in a large monorepo?

I would have imagined performance concerns could be mitigated with sub-matchers, but maybe I'm misunderstanding the problem

tetromino commented 1 month ago

@matts1 - the performance problem is that in a monorepo, a change to REPO.bazel or WORKSPACE invalidates all loading/analysis caches company-wide, including the CI/CD systems :)

For the non-monorepo case, default registration could probably be useful - but were weren't sure how useful, how many people would be interested.

matts1 commented 1 month ago

Could that potentially be solved by moving the finalizers outside of REPO.bazel, with REPO.bazel only referencing them? This seems like a nice solution, but I don't know how feasible interleaving the loading and analysis phase would be, which might be required for this. That being said, my design doc regular rules in module extensions was approved, and does require solving that problem.

# REPO.bazel
register_default_finalizers("//:default_finalizers")
# //BUILD
default_finalizer_group(
    name = "default_finalizers",
    finalizers = [
        "//foo/bar:default_finalizer",
    ]
)
# //foo/bar/BUILD
finalizer_rule(
    name = "default_finalizer",
    default_finalizer = True,
    ...
)
tetromino commented 1 month ago

@matts1 - if package //foo can theoretically get a default finalizer registered in by a specification defined in \<bar>, then a change to bar must invalidate the loading and analysis cache for //foo. It does not matter how \<bar> is structured - whether it's a single file or a group of files, whether it's REPO.bazel or some .bzl file referenced by REPO.bazel, or whatever other mechanism. The cache will be invalidated. And if all packages in a repo can theoretically get a default finalizer registered in them by a specification defined in \<bar>, then any change to bar will invalidate all caches for that repo.

The solution for big monorepos might be along the lines of sharding the repo into components or projects which have independent config files. It will take some designing.

matts1 commented 1 month ago

Ah, didn't realize that. Adding to your sharding idea, we could do something like the following, where we only allow a finalizer to come from parent packages:

# REPO.bazel
register_default_finalizers("//:default_finalizers")
# //BUILD
default_finalizer_group(
    name = "default_finalizers",
    finalizers = [
        "//foo:default_finalizers",
        # We *may* want to disallow this, to ensure that only changes to the direct parent c
    ]
)
# //foo/BUILD
default_finalizer_group(
    name = "default_finalizers",
    finalizers = [
        "//foo/bar:default_finalizers",
       # This would be disallowed because //baz is not a subpackage of //foo.
        "//baz:default_finalizers",
    ]
)
# //foo/bar/BUILD
finalizer_rule(
    name = "default_finalizer",
    default_finalizer = True,
    ...
)

Though thinking about it now, this seems very similar to my original proposal, with the only real distinction being that if you don't configure finalizers, the parent BUILD files don't invalidate the child ones.

matts1 commented 1 month ago

I might be misunderstanding how things work, but if, for example, you had a global configuration file:

register_default_finalizer("//foo/bar/...", "foobar_finalizer")
register_default_finalizer("//foo/...", "foo_finalizer")

Then it could evaluate to:

{
   SkyKey("all_finalizers"): SkyValue(["//foo", "//foobar"])
   SkyKey("default_finalizer://foo/bar"): SkyValue("foobar_finalizer"),
   SkyKey("default_finalizer://foo"): SkyValue("foo_finalizer")
}

Then the SkyFunction for the package "//foo/bar/baz" would take in the SkyValues for the default finalizers for ["//", "//foo", "//foo/bar", //foo/bar/baz"]. This seems like it'd prevent a cache invalidation without needing to shard, since the SkyValues for those things wouldn't change unless they'd actually been changed in the config file, instead of if they could have theoretically been changed.

That being said, this might need a bit of a rework to skyframe itself to allow you to map SkyKey(default_finalizer://unknown_key) to SkyValue(None) by checking if it's contained within "all_finalizers". I don't have enough experience with skyframe to know whether this proposal is viable or not.