Allow repository_macro to reference local_repository labels

bazelbuild / bazel-gazelle

Gazelle is a Bazel build file generator for Bazel projects. It natively supports Go and protobuf, and it may be extended to support new languages and custom rule sets.

Apache License 2.0

1.15k stars 369 forks source link

Allow repository_macro to reference local_repository labels #704

Open naturalwarren opened 4 years ago

naturalwarren commented 4 years ago

Versions

gazelle: 0.19.1 rules_go: 0.20.3 bazel: 0.29.1 OS: Mac OS Catalina Processor Architecture: x86

Reproduces with the latest versions of all releases.

What did you do?

I use Gazelle in a monorepo where we have a common workspace (/vendor/go) that all other workspaces rely on to import 3rd party dependencies. This is done to keep dependency versions the same across all projects.

/vendor/go contains:

go.mod lists all vendor dependencies used in the monorepo
repositories.bzl generated by running update-repos

Workspaces that use third party dependencies import repositories.bzl and invoke the generated macro:

# Load vendor dependencies.
load("@vendor//:repositories.bzl", "go_repositories")

# gazelle:repository_macro /Users/naturalwarren/monorepo/vendor/go/repositories.bzl%go_repositories
go_repositories()

What did you see?

ERROR: An error occurred during the fetch of repository 'bazel_gazelle_go_repository_config':
   Traceback (most recent call last):
    File "/private/var/tmp/_bazel_warren/a4cf7d94b7ff7afb4d2537dddc045b0e/external/bazel_gazelle/internal/go_repository_config.bzl", line 26
        _find_macro_file_labels(ctx, ctx.attr.config)
    File "/private/var/tmp/_bazel_warren/a4cf7d94b7ff7afb4d2537dddc045b0e/external/bazel_gazelle/internal/go_repository_config.bzl", line 101, in _find_macro_file_labels
        Label(((("@" + label.workspace_name) +...))
Illegal absolute label syntax: @//:/Users/naturalwarren/monorepo/vendor/go/repositories.bzl

Without the directive the visibility of internal targets get generated incorrectly. Upon consuming a top level generated target I see:

ERROR: /private/var/tmp/_bazel_warren/a4cf7d94b7ff7afb4d2537dddc045b0e/external/com_google_cloud_go_storage/BUILD.bazel:3:1: in go_library rule @com_google_cloud_go_storage//:go_default_library: target '@com_google_cloud_go//internal:go_default_library' is not visible from target '@com_google_cloud_go_storage//:go_default_library'. Check the visibility declaration of the former target if you think the dependency is legitimate

Can repository_macro support referencing Bazel labels?

jayconrod commented 4 years ago

Marking as a bug because there should be a better error message here.

I don't think Gazelle can support this though. We can't load files from external workspaces without some assurance that the repository rule for that workspace has been executed and is up-to-date. Some variant of bazel query or bazel fetch would be the best way to do that, but Gazelle does not invoke Bazel commands. If it did, it wouldn't be safe to use Gazelle in repository rules (go_repository).

It might be possible to add special cases to make this work specifically for local_repository, but that seems too specialized to me. In general, local_repository has a lot of problems that I don't have bandwidth to support (overlapping workspaces, symbolic link cycles, and so on).

naturalwarren commented 4 years ago

Thanks @jayconrod.

I don't think Gazelle can support this though. We can't load files from external workspaces without some assurance that the repository rule for that workspace has been executed and is up-to-date.

I'm not sure I follow here. The repository rule definition is provided by the calling WORKSPACE. When the repositories.bzl#go_repositories is invoked it is using rule references from the caller.

To expand on our setup a little more, we code generate the WORKSPACE of the repository that performed the generation of repositories.bzl and we code generation the WORKSPACE of the caller. We have the guarantee that the rule sets are identical. I'm not sure this matters though because in both our setup and a "standard setup" someone could forget to run update-repos and regenerate the repositories.bzl after bumping their rulesets.

It might be possible to add special cases to make this work specifically for local_repository, but that seems too specialized to me. In general, local_repository has a lot of problems that I don't have bandwidth to support (overlapping workspaces, symbolic link cycles, and so on).

Makes sense that you don't want to open up the local_repository can of worms. Would it be possible (and valid) to allow an arbitrary file path to be passed to point to a macro file that is not within the workspace?

jayconrod commented 4 years ago

I'm not sure I follow here. The repository rule definition is provided by the calling WORKSPACE. When the repositories.bzl#go_repositories is invoked it is using rule references from the caller.

Gazelle doesn't evaluate WORKSPACE though. Gazelle can be run as a separate tool, so we can't even assume Bazel evaluated WORKSPACE before invoking Gazelle. And # gazelle:repository_macro could refer to a function that isn't actually called from WORKSPACE or is only called under certain conditions. There are too many weird cases to guard against, so I don't want to make assumptions.

To expand on our setup a little more, we code generate the WORKSPACE of the repository that performed the generation of repositories.bzl and we code generation the WORKSPACE of the caller. We have the guarantee that the rule sets are identical. I'm not sure this matters though because in both our setup and a "standard setup" someone could forget to run update-repos and regenerate the repositories.bzl after bumping their rulesets.

Are all the workspaces in the same monorepo? Does anything outside the monorepo depend on workspaces inside the monorepo? Why have separate workspaces instead of one big one that matches the monorepo?

Makes sense that you don't want to open up the local_repository can of worms. Would it be possible (and valid) to allow an arbitrary file path to be passed to point to a macro file that is not within the workspace?

I'd rather not support that. If I understand correctly, you have a monorepo with multiple workspaces (not in the root directory), so this would be fine for you, but I think most people would use this to refer to local files not present on other dev machines. I'd rather encourage workspaces to be more hermetic.

naturalwarren commented 4 years ago

Gazelle doesn't evaluate WORKSPACE though. Gazelle can be run as a separate tool, so we can't even assume Bazel evaluated WORKSPACE before invoking Gazelle. And # gazelle:repository_macro could refer to a function that isn't actually called from WORKSPACE or is only called under certain conditions. There are too many weird cases to guard against, so I don't want to make assumptions.

Fair enough.

Are all the workspaces in the same monorepo?

Yes they are.

Does anything outside the monorepo depend on workspaces inside the monorepo?

No, you have to be in the monorepo to depend on other projects inside the monorepo.

Why have separate workspaces instead of one big one that matches the monorepo?

Our repository is large enough that loading all the rulesets and attempting to bring the whole monorepo into an IDE on a laptop computer is not possible.

I'd rather not support that. If I understand correctly, you have a monorepo with multiple workspaces (not in the root directory), so this would be fine for you, but I think most people would use this to refer to local files not present on other dev machines. I'd rather encourage workspaces to be more hermetic.

My understanding was the mechanism that facilitates hermicity is the sandbox. The file would be copied into the sandbox and a checksum would be computed to determine if the input ever changed.

Zooming out, we support all vendor dependencies across languages in the same fashion as google/bazel-common. With Gazelle not being able to reference macros from another workspace, it can't be used within this model of dependency management.

jayconrod commented 4 years ago

Our repository is large enough that loading all the rulesets and attempting to bring the whole monorepo into an IDE on a laptop computer is not possible.

I'm guessing this means the monorepo too large to check out onto disk and load everything into RAM? I can't speak for IDEs, but Bazel itself should scale well in any size repository, as long as it doesn't have to load everything. Google has a monorepo that is nowhere close to fitting on a workstation, and Bazel (Blaze) handles it fine as one workspace (files are typically accessed through a custom file system). blaze build //... obviously doesn't work, but building any individually interesting set of targets works well.

My understanding was the mechanism that facilitates hermicity is the sandbox. The file would be copied into the sandbox and a checksum would be computed to determine if the input ever changed.

Sandboxing is important for hermeticity, but it's not everything. Repository rules aren't sandboxed at all, and they can access files and execute commands on the host system, so it's the responsibility of rule authors (and users to some degree) to make sure they're written and used in a hermetic way.

My concern with letting # gazelle:repository_macro point outside the workspace is that people will use it to point outside the repository. That means a build will work for one person but not someone else who doesn't have the right file in the right place. Restricting it to be a label is a little more straightforward, but it still seems like there are complicated failure modes.

naturalwarren commented 4 years ago

Thanks @jayconrod. Added some more details about our setup below.

I'm guessing this means the monorepo too large to check out onto disk and load everything into RAM? I can't speak for IDEs, but Bazel itself should scale well in any size repository, as long as it doesn't have to load everything.

Yes, it's too big to be loaded into RAM.

Google has a monorepo that is nowhere close to fitting on a workstation, and Bazel (Blaze) handles it fine as one workspace (files are typically accessed through a custom file system)

We don't have a custom file system at this time and the whole repository can be loaded onto a hard drive. A "project" has a 1-to-1 relationship with a WORKSPACE. We have some tooling that generates the WORKSPACE for the project based on the kind of language the project is developed in. You can add capabilities (ie protobuf, gRPC) to your project and that determines what additional rulesets get added onto a base set (ie go_library, go_test).

My concern with letting # gazelle:repository_macro point outside the workspace is that people will use it to point outside the repository. That means a build will work for one person but not someone else who doesn't have the right file in the right place. Restricting it to be a label is a little more straightforward, but it still seems like there are complicated failure modes.

Makes sense. Even with restricting directives to a label (as opposed to a file path) folks could still point to a WORKSPACE outside the repository.

Boiling down the conversation here it looks like there's a trade off to be made between extensibility (being able to support arbitrary WORKSPACE setups) vs. coercing users into what we know is a safe use of the library 100% of the time. Making the assumption that library consumers are only going to point # gazelle:repository_macro to a file checked into a VCS system sounds very reasonable to me.

jayconrod commented 4 years ago

Boiling down the conversation here it looks like there's a trade off to be made between extensibility (being able to support arbitrary WORKSPACE setups) vs. coercing users into what we know is a safe use of the library 100% of the time.

That about sums it up. Though I would probably say "encouraging" instead of "coercing". :)

Making the assumption that library consumers are only going to point # gazelle:repository_macro to a file checked into a VCS system sounds very reasonable to me.

I think I'd accept a PR for this as long as the file is restricted to being a label (no arbitrary absolute or relative paths). update-repos -to_macro can't support this, since we should never write file in another workspace. If the file can't be read, anything that expects it to be there should fail with an error explaining the repo needs to be fetched first. Note that repo.FindExternalRepo can be used to locate workspace directories, but it will only work after those workspaces are evaluated, even for local_repository.

dragonsinth commented 3 years ago

@jayconrod I had a question about some of your comments in this thread--

It seems like gazelle actually runs in a few very different modes...

fix-update (normally manual)
update-repos (normally manual)
go_repository rule invocation

For that third one, I completely understand the idea of not invoking bazel or bazel query, since you're in the context of a bazel run already. But for the first two, is it that hard-and-fast of a requirement? (technically, it seems like the third case could even be a separate command binary if you wanted)

jayconrod commented 3 years ago

Yes, please consider that a firm design requirement.

In the early days, it was a technical requirement: bazel run held a lock on the output directory, so recursive invocations of Bazel would deadlock. That's no longer the case. However, I still think it's an important constraint.

If Gazelle interprets Starlark or invokes Bazel, it will be much more difficult to understand what's it's doing. There are cases where Gazelle could be a bit smarter, but I don't think that would deliver a better user experience, and it would definitely be harder to maintain. There are already some cases in dependency resolution where I think it would be better to fail quickly instead of going out to the network to look something up. That would make it easier for people to model what Gazelle's doing and make sure the right information is available.