Support (restrictive) load in MODULE.bazel

bazelbuild / bazel

a fast, scalable, multi-language and extensible build system

https://bazel.build

Apache License 2.0

23.32k stars 4.09k forks source link

Support (restrictive) load in MODULE.bazel #17880

Closed joca-bt closed 7 months ago

joca-bt commented 1 year ago

Description of the feature request:

Currently, the MODULE.bazel file must contain everything needed by the project within it, and cannot call load statements to help make it easier to navigate.

One situation in which it would be helpful to use load is when configuring external package managers, e.g. like with rules_jvm_external. On a Java project it is not unusual to depend on several external jars. Putting that configuration in MODULE.bazel can be considered as polluting the file and makes it more difficult to maintain.

This was originally opened in https://github.com/bazelbuild/rules_jvm_external/issues/854 and that issue provides further details on a specific example with rules_jvm_external.

Having the ability to use load to load constants or macros from other files would help us maintain a more organized MODULE.bazel.

What underlying problem are you trying to solve with this feature?

No response

Which operating system are you running Bazel on?

Linux

What is the output of `bazel info release`?

6.0.0

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

fmeum commented 1 year ago

I don't think I'm exaggerating that much with the following claim: The availability of and reliance on load in WORKSPACE was a major contributor to its subpar performance.

Module extensions can accept label attributes on their tags and collect dependency information from the files pointed to by these labels. Without any changes to Bazel itself, rules_jvm_external could allow for the following usage where artifacts.json is some other file in your repo with e.g. an array of Maven coordinates:

maven = use_extension("@rules_jvm_external//:extensions.bzl", "maven")
maven.install(artifacts = "//:artifacts.json")

This would keep your MODULE.bazel file clean and your external Java dependencies separated from all the others while still allowing for fast discovery of the entire dependency tree.

Wyverald commented 1 year ago

What @fmeum said; although, we might explore allowing load from the current repository for the root module only, to better support large monorepos that will never be used as a dependency.

shs96c commented 1 year ago

Would there be an option of using load in a local repository, but when we upload to something like the BCR, having the load statements inlined so there's just the single MODULE.bazel file without any external dependencies?

Wyverald commented 1 year ago

See this comment in #14632 for a previous abandonded attempt.

We don't currently have the cycles to work on this. But if anyone is interested in picking up the work and writing a design for it (addressing the concerns in the comment above), we'd be happy to review PRs.

fmeum commented 1 year ago

We could:

allow load in the root module only;
restrict load to .bzl files in the root module (including transitive loads);
do not add any of the MODULE.bazel globals to the environment of the loaded .bzl files. Instead users will have to pass extension proxies into macros explicitly or load plain Starlark data. Edit: Possibly even better, prohibit calling the globals (but not proxy tags) in loaded files, which gives better error messages and prevents passing them into macros.

The last restriction makes it less likely for two different loaded files to define conflicting bazel_deps. Typical usages could look like this:

load("//:go.bzl", "declare_go_overrides", "GO_REPOS")

go_deps = use_extension(...)
go_deps.from_file(go_mod = "//:go.mod")
declare_go_overrides(go_deps)
use_repo(go_deps, *GO_REPOS)

sluongng commented 1 year ago

One situation in which it would be helpful to use load is when configuring external package managers, e.g. like with rules_jvm_external. On a Java project it is not unusual to depend on several external jars. Putting that configuration in MODULE.bazel can pollute the file and make it more difficult to maintain.

The pain point of maintaining a large MODULE.bazel file is completely valid. But the solution of allowing load statements in there could be the wrong route to turn.

Looking into the current Go ecosystem, I don't see anyone asking for the ability to "load / import" within go.mod / go.sum files. The reason is go mod tidy provides everything you need to manage a go.mod file in a Go project: populating the content of the file, sorting, formatting entries, etc... An average go developer would not even look into a go.mod file as the tool did a great job at managing it. As a result, it's very rare to hear that the file is too big to manage.

So is it possible for us to solve this problem with better automation tooling than relying on dynamic template loading? For example: perhaps a bazel/gazelle mod tidy could help us populate the MODULE.bazel file automatically by scanning through the existing project. The convention of how to arrange and sort the content of MODULE.bazel could also be reinforced there to avoid style-related arguments.

ashi009 commented 1 year ago

One situation in which it would be helpful to use load is when configuring external package managers, e.g. like with rules_jvm_external. On a Java project it is not unusual to depend on several external jars. Putting that configuration in MODULE.bazel can pollute the file and make it more difficult to maintain.

The pain point of maintaining a large MODULE.bazel file is completely valid. But the solution of allowing load statements in there could be the wrong route to turn.

Looking into the current Go ecosystem, I don't see anyone asking for the ability to "load / import" within go.mod / go.sum files. The reason is go mod tidy provides everything you need to manage a go.mod file in a Go project: populating the content of the file, sorting, formatting entries, etc... An average go developer would not even look into a go.mod file as the tool did a great job at managing it. As a result, it's very rare to hear that the file is too big to manage.

So is it possible for us to solve this problem with better automation tooling than relying on dynamic template loading? For example: perhaps a bazel/gazelle mod tidy could help us populate the MODULE.bazel file automatically by scanning through the existing project. The convention of how to arrange and sort the content of MODULE.bazel could also be reinforced there to avoid style-related arguments.

The reason why go is allowed to do so is because the source of truth of dependencies relies in the source tree, therefore it's okay to let tool generate it.

But for module.bazel, it's already the source of truth, and there are lots of human written parts in it, eg. repo override configurations, etc. Automation is not entirely feasible .

sluongng commented 1 year ago

@ashi009 In my mind, a command such as bazel mod tidy should solve several things:

Update MODULE.bazel with appropriate configs using current source tree. It's possible that somewhere in the source tree, a user would add to their BUILD file a load statement referring to an external rule. bazel mod tidy should be able to detect such usage and automatically it into MODULE.bazel
Consistent formatting of MODULE.bazel. There should be an extensible set of config grouping in MODULE.bazel, one for each popular languages. The order and format of these groups within MODULE.bazel file should be deterministic so that we could run bazel mod tidy to format it over time.
Similar to go.mod's replace directive, there should be an override section for folks to customize. However, I think the goal should be to keep this section small and limited. More extensive re-configuration and patching could/should be done by using a different BzlMod Registry instead of using MODULE.bazel file. The custom registry could be an in-tree directory managed by source control so the setup could be relatively simple. You could also use multiple registries overlaying each others thus re-using Bazel central registry is possible.

I think at minimum, we should defer the addition of load / import within MODULE.bazel after BzlMod has been out for a while. It would let us direct the solutions toward (1) better automation tooling to manage the file and (2) alternative registries as much as possible. After the initial GA of BzlMod, we should be able to identify problems that could not be solved using either of the above approaches and add load / import if needed.

The reverse, however, is not feasible. Once load / import is added, we will have folks depending on that behavior, and taking it away will cause heavy churn/migration.

ashi009 commented 1 year ago

@sluongng Totally agree on the first two. However, not true for the third. As at most of the time we are overriding the extension generated repos.

In a relatively large mono repo, the number of repos that need a second touch is a lot. Our current non-bzlmod rules_go generated bzl file with lots of patches and custom gazelle directives is over 17k loc.

I don't see how this gonna play out in a single module.bazel file with the help of any tool. Even that is feasible, reviewing that file is also challenging -- how do we delegate a portion of a file to some reviewer while other portion to some others?

Also a note on mirroring the bcr, it's not a valid solution for most of the users, it adds too much toil for who ever maintaining that mirror -- just thinking about pulling upstream releases and resolving merge conflicts.

Wyverald commented 1 year ago

Update MODULE.bazel with appropriate configs using current source tree. It's possible that somewhere in the source tree, a user would add to their BUILD file a load statement referring to an external rule. bazel mod tidy should be able to detect such usage and automatically it into MODULE.bazel

I'm not sure this is feasible. Unlike Go where the import path fully identifies the module, in Bazel/Bzlmod you just see people using @rules_foo//:something and you've no idea whether that's actually the rules_foo you have on the BCR. Maybe they intended to import the module company_rules_foo as rules_foo, for example.

This is even less tractable if you consider module extensions and repos generated from them.

cgrindel commented 1 year ago

@ashi009 Just a quick clarification. With the custom registry solution, you don't need to mirror the entire BCR. The custom registry only needs to contain the customized modules.

sluongng commented 1 year ago

Also a note on mirroring the bcr, it's not a valid solution for most of the users, it adds too much toil for who ever maintaining that mirror -- just thinking about pulling upstream releases and resolving merge conflicts.

@ashi009 I could be wrong, so please do correct me if I am. Here, I am not suggesting mirror-ing. I am suggesting using multiple registries on top of each others:

# .bazelrc
build --registry=./internal/registry/
build --registry=https://bcr.bazel.build

So you would only need to maintain your internal module customizations.

I'm not sure this is feasible. Unlike Go where the import path fully identifies the module, in Bazel/Bzlmod you just see people using @rules_foo//:something and you've no idea whether that's actually the rules_foo you have on the BCR. Maybe they intended to import the module company_rules_foo as rules_foo, for example.

This is even less tractable if you consider module extensions and repos generated from them.

@Wyverald I think a best-effort lookup from the existing registries should be sufficient. In case a repo is missing, simply provide the users with a warning/error message. go mod tidy does the same when a referenced module could not be found. This is also why enterprises often operate their own internal Go Proxy pointing to internal modules.

For module extension, we could use the approach existing in gazelle, which is to provide directives in the form of comments. These directives will help guide the automation tool to map a repo name to a group/path inside MODULE.bazel. After a while, we could identify some common directives and upgrade support for them to the registry level.

cgrindel commented 1 year ago

The custom registry could be an in-tree directory managed by source control so the setup could be relatively simple.

I think @sluongng idea for an in-repo registry could be the answer. It satisfies two important criteria in my mind:

Keep the customizations for the external dependencies in the repo. (Spreading them out to different repos does not help the problem.)
Compartmentalizes the changes by external dependency. (Keeps things clean.)

@Wyverald I have not dug into the Bazel code for the registry yet. At first blush, it seems that if the on-disk custom registry had the correct layout, it would just be a matter of saying to Bazel "look over here" instead of "download from there". Am I thinking about this correctly?

Wyverald commented 1 year ago

I have not dug into the Bazel code for the registry yet. At first blush, it seems that if the on-disk custom registry had the correct layout, it would just be a matter of saying to Bazel "look over here" instead of "download from there". Am I thinking about this correctly?

Indeed. The only caveat is that you'd need to use a local_path type for your source.json files: https://bazel.build/external/registry#:~:text=The%20type%20can%20be%20changed%20to%20use%20a%20local%20path

sluongng commented 1 year ago

@Wyverald The doc is giving me the impression that

When source.json type is local_path, then the registry must be hosted locally.

It does not imply the reverse which I think you are implying though:

When a registry is hosted locally, all the source.json in it must have type local_path

Am I understanding this wrongly? Because I was expecting that archive type would still work with a local registry.

Wyverald commented 1 year ago

You're right -- I didn't try to imply that you couldn't use the archive type with local registries. I just had the impression that Chuck was asking for a no-download, just-use-this-directory registry, but looking back now I might have misinterpreted his comment.

cgrindel commented 1 year ago

I just had the impression that Chuck was asking for a no-download, just-use-this-directory registry

This is what I am asking. Thanks. 🙂

cgrindel commented 1 year ago

I wonder if there might be two use cases that may deserve their own solutions:

A repo that uses a ruleset to access third-party dependencies (e.g. rules_jvm_external, rules_go, rules_swift_package_manager).
A repo with many dependencies.

Scenario 1: A repo that uses a ruleset to access third-party dependencies

There are two parts to this scenario.

There is the declaration of the language-specific dependencies. This seems to be well handled by the ruleset providing an extension tag to load the file.
Declaring the repositories generated by the extension. (e.g. use_repo(go_deps, ...))

The second part is a bit of a pain, today. I provided a rudimentary way to generate this for clients in rules_swift_package_manager. I believe that a more universal solution is being worked.

Here are two thoughts on how to eliminate listing the declared repositories in the use_repo(...) declaration:

Allow extensions to return a provider with the repositories that they provide.
Update use_repos to accept a file that lists the repos (e.g. use_repo(go_deps, from_file = "go_repos.json")). A benefit to this approach is that the ruleset maintainers can provide tooling to generate the file for the client.

Scenario 2: A repo with many dependencies.

If the desire is to separate the MODULE.bazel into different files so that it can be organized, that does seem reasonable. However, as has been mentioned previously, the traditional load() is problematic. Let's call it something completely different with a name that does not imply anything but what it does (e.g import_module_bazel()). There would be no change to the semantics on how the MODULE.bazel is processed.

# File: MODULE.bazel
import_module_bazel(
    "MODULE.runtime.bazel",
    "MODULE.dev.bazel",
)

# File: MODULE.runtime.bazel
bazel_dep(name = "rules_jvm_external", version = "1.0.0")
import_module_bazel("MODULE.jvm.bazel")

# File: MODULE.jvm.bazel
jvm_deps = use_extension(...)
jvm_deps.from_file("jvm_deps.json")
use_repo(
    jvm_deps, 
    # List of repositories declared by jvm_deps from jvm_deps.json
)

# File: MODULE.dev.bazel
bazel_dep(
    name = "gazelle",
    version = "0.29.0",
    dev_dependency = True,
)

The import of the files would work similarly to .bazelrc imports. In addition, it would be an error to import the same file multiple times.

With this approach, we do not introduce the baggage that load() might bring while allowing maintainers to organize their external dependency declarations as they wish.

shs96c commented 1 year ago

One thing to bear in mind: in large repos, it's not desirable for people to edit files in the top-level directory (such as MODULE.bazel) and getting reviews for changes in these files can be prohibitively long or complicated.

With a workspace file, the answer is simple: segment using load and structure the source tree initially on a per-language basis. This broadens the pool of people who can review specific parts of the workspace file. Without a load option, there'd be no way of doing this using bzlmod.

One thing I'd suggest would be to allow load statements in MODULE.bazel files, but when uploading to a registry, the load statements are "unrolled" so we end up with a single file, as we have now. This allows people the flexibility to organise their source tree how they want, and also retains the simplicity of the existing model (and it's similar to how search engines use spiders with JS enabled, but allow a very limited number of JS evaluations in order to get meaningful results)

We already have tooling for calculating the hashes in BCR, so performing some more work when adding or updating a module doesn't seem unreasonable.

If this approach is taken, then it would be necessary for overridden modules to also use load statements.

This doesn't preclude us providing better tooling for handling third party deps.

sluongng commented 1 year ago

@shs96c what do you think about using a local registry exposing custom modules+extensions to solve that problem?

For example: in a monorepo with both Go and Java, the local registry could expose Go dependencies via 1 custom module+extension and Java in another. On disk, these will be 2 different directories under /modules/{internal_go,internal_java} and thus you could apply CODEOWNERS / OWNERS control appropriately.

Then, the MODULE.bazel file will effectively be

bazel_dep(name = "internal_java", version = "1.0")
maven = use_extension("@internal_java//:extensions.bzl", "maven")
maven.install()

bazel_dep(name = "internal_go", version = "1.0")
go_mod = use_extension("@internal_go//:extensions.bzl", "go_mod")
go_mod.install()

with .bazelrc file being

common --registry=./local_registry
common --registry=https://bcr.bazel.build

and you would not have to edit it as much.

fmeum commented 1 year ago

Just want to clarify some points that have been brought up so far:

There is the declaration of the language-specific dependencies. This seems to be well handled by the ruleset providing an extension tag to load the file.

Keep in mind that some languages lack a well-established format for dependency specifications. While Go, Python and Node have this, Java doesn't. Of course such a format could be invented, but that leads to the question "Why not just Starlark"?

Allow extensions to return a provider with the repositories that they provide.

This exists in the form of module_ctx.extension_metadata, but it only provides a warning with fixup commands, it can't straight up replace a use_repo call unless we accept the cost of eagerly evaluating this extension for every build. When the lock file support matures, that may even be an option.

Update use_repos to accept a file that lists the repos (e.g. use_repo(go_deps, from_file = "go_repos.json")). A benefit to this approach is that the ruleset maintainers can provide tooling to generate the file for the client.

At least for the root module this could work, but I wonder whether this is really better than load. Starlark supports comments and trailing commans while being slightly less verbose than JSON. Rulesets could still generate the Starlark file, although I think module_ctx.extension_metadata is the way to do this. We could (and e.g. @tyler-french has already asked about this) consider having it output a file in addition to printing fixup commands to the console.

import_module_bazel (or any other .bazelrc style import mechanism)

This suffers from the issues explained in https://github.com/bazelbuild/bazel/issues/14632#issuecomment-1230186595. A huge problem with WORKSPACE was its iterative nature, where any transitively called macro could have the side effect of registering a dependency. With Bzlmod, this would result in a hard error instead of a silent change of the dependency graph, which is better, but would still not result in strict decoupling.

If we restrict load in a way similar to https://github.com/bazelbuild/bazel/issues/17880#issuecomment-1574144457, we can ensure that statements with side effects can only be executed at the top level, that is, right in the MODULE.bazel file. Essentially, the loaded files will only provide data, not code.

unrolling load

Since MODULE.bazel doesn't rely on the "export hack" where the name of a variable is attached to the object it references, this may actually be feasible with a lot of name mangling. It would probably best live in buildozer. @vladmos Has the idea of inlining loads already been brought up at some point? Do you think that this would be feasible to implement for .bzl dialects that do not rely on the export hack?

custom registries per language

@sluongng How do you imagine the code you posted above to work with strict deps?

bazel_dep(name = "internal_go", version = "1.0")
go_mod = use_extension("@internal_go//:extensions.bzl", "go_mod")
go_mod.install()

internal_go would presumably use go_deps under the hood to declare the actual Go dependencies, but that would still require the root module to use_repo them. Since module extensions can't provide tags to other module extensions, I don't see what the go_mod extension would do in this example.

A custom registry is a great way to reduce the amount of *_overrides in the MODULE.bazel file, but I am not sure it can provide the same benefit for extension-managed non-Bazel-aware dependencies.

sluongng commented 1 year ago

How do you imagine the code you posted above to work with strict deps?

internal_go would presumably use go_deps under the hood to declare the actual Go dependencies, but that would still require the root module to use_repo them. Since module extensions can't provide tags to other modules, I don't see what the go_mod extension would do in this example.

I think I misunderstood that extensions could call use_repo, hence my suggestion above.

I will have to think about this a bit more. But if use_repo is the only blocker, then perhaps we should just solve it there. For example, allowing something like use_repo(go_mod, use_all = True) to load all declarations in mctx.extension_metadata eagerly with use_all = False by default to load things lazily instead. It's applicable, especially for custom local modules where you already have fine-grain control of what's going to be included within the module's extension logic.

Ideally, I think we should leverage a custom local registry as the go-to setup, which would enable large monorepo(es) to have a sane third-party dependencies management system. We simply should not invent a second "load" system so that half of the third-party deps config is managed in an internal registry (local or remote) and half is managed in a bazel/build_rules/override dir somewhere in a monorepo. It's not ergonomic and unfriendly to newer users. So I would favor approaches that would add more features/functionalities to the existing registry setup over repeating the load approach of WORKSPACE.

fmeum commented 1 year ago

For example, allowing something like use_repo(go_mod, use_all = True) to load all declarations in mctx.extension_metadata eagerly with use_all = False by default to load things lazily instead.

This could work (except not with that syntax since use_repo already accepts arbitrary keyword arguments). Its performance impact could be softened by Bzlmod lock files.

But if use_repo is the only blocker, then perhaps we should just solve it there.

I am not sure that it is though. @ashi009 brought up overrides of deps fetched by module extensions and @shs96c brought up rules_jvm_external's specifications of Java deps as other cases that aren't that usable yet. If anyone has other cases, please add them so that we know the full extent of the problem we need to solve.

So I would favor approaches that would add more features/functionalities to the existing registry setup over repeating the load approach of WORKSPACE.

I fully agree that we should never bring WORKSPACE's load semantics to Bzlmod. But loads that are free of side effects would already be much better than that.

cgrindel commented 1 year ago

This suffers from the issues explained in https://github.com/bazelbuild/bazel/issues/14632#issuecomment-1230186595. A huge problem with WORKSPACE was its iterative nature, where any transitively called macro could have the side effect of registering a dependency. With Bzlmod, this would result in a hard error instead of a silent change of the dependency graph, which is better, but would still not result in strict decoupling.

@Wyverald @fmeum I am not convinced this is as hard to maintain as is suggested in the comment. Also, I don't think that the goal is strict decoupling. The goal is to organize the dependencies in a way that makes sense for the client. We should just focus on how to help the client identify conflicts and not shoot themselves in the foot.

For example, if two segments both declare a bazel_dep on the same module, we'd just error out since a module can't depend on the same module twice (without a multiple-version override, that is).

Failing is good. The trick is to be clear why it is failing and where the conflicting declaration resides.

One might be tempted to say "just pick the highest version"

I do not think that we should do this. A well-crafted error feels like the right result to me.

What if segment A declares bazel_dep(X@1.0, repo_name="foo"), and segment B declares bazel_dep(Y@1.0, repo_name="foo")?

This could happen in a large single MODULE.bazel file. We should provide an error that helps the client find the conflict.

There is indeed a parallel with import in .bazelrc files, but that statement is usually used to import user-defined RC files, instead of segmenting a long config file. There's much less of a need to import a "user-defined segment of MODULE.bazel directives".

I disagree. I do use multiple bazelrc files to organize the configuration and share it with child workspaces. However, I think that the real point is that the "import" concept would help clients organize their files while still maintaining the strict standards that already exist with Bazel modules.

One thing to bear in mind: in large repos, it's not desirable for people to edit files in the top-level directory (such as MODULE.bazel) and getting reviews for changes in these files can be prohibitively long or complicated.

With regard to @shs96c point, in my experience with mono repos, the segmentation of external dependencies is almost always by language. If that is the primary case, there is little opportunity for code owners for different languages to conflict with their repository names.

With regard to using load() vs a new "import" function, I think that we would be best served not to special-case load() for MODULE.bazel files. The load() concept means something in Starlark and that is not what is being suggested by introducing "import" into MODULE.bazel files. The end result of an "import" is a single contiguous file with declarations.

jmillikin commented 1 year ago

Is it worth separating the "reusable module" and "big monorepo workspace" into separate files?

It seems to me that the tension in trying to have a single place to configure both is what's making it difficult to decide on a direction.

You could imagine having a WORKSPACE file where the workspace() call at the top has some extra parameter indicating that a semantic halfway between "load anything anywhere" and "pre-declare everything" must be used. This would forbid directly using load() from WORKSPACE, but would replace it with some new functions that let the author more directly express what they want.

Rough sketch with placeholder names:

# WORKSPACE.bazel
workspace(
    name = "my_monorepo",
    only_module_deps = True,
)

# Load a file from the local workspace that contains a "workspace rule", which can
# be used for setting up the workspace's module hierarchy.
#
# The `use_repos` call is like bzlmod `use_repo`, but since enumerating all the
# external repo names would defeat the purpose, it just takes a single workspace
# rule to load externals for.
#
third_party = load_workspace_rule("//third_party:third_party.bzl", "third_party")
use_repos(third_party)

# alternatively, maybe just call it? I don't know which API the Bazel folks would prefer here.
third_party()

# third_party/third_party.bzl
load("//third_party/github.com/google/googletest:googletest.bzl", "googletest")

def _third_party(workspace_ctx):
    googletest(workspace_ctx)
    # ...

# `workspace_rule` is a bit like `module_extension`, allowing controlled definition of external repos.
#
# Since the `WORKSPACE` file is only evaluated in the root module, this functionality cannot leak
# into BCR modules.
third_party = workspace_rule(_third_party)

# third_party/github.com/google/googletest/googletest.bzl

def googletest(workspace_ctx):
    # works like `MODULE.bazel` global `bazel_dep` -- other module functions can also be provided
    # so that a multi-level dependency graph (rules_go -> gazelle -> Go third-party libs) can be expressed.
    workspace_ctx.bazel_dep(
        name = "googletest",
        version = "1.12.1",
        dev_dependency = True,
        repo_name = "com_google_googletest",
    )

aherrmann commented 1 year ago

My understanding is that extension_metadata together with buildozer automation as described in the proposal would solve the issue of maintaining large use_repo stanzas for third party dependencies like Go modules or Maven dependencies. This approach keeps MODULE.bazel files declarative and avoids the kind of statefulness that WORKSPACE files have.

A use_all = True alike concept in conjunction with lock-files could be a natural extension of this.

One thing to bear in mind: in large repos, it's not desirable for people to edit files in the top-level directory (such as MODULE.bazel) and getting reviews for changes in these files can be prohibitively long or complicated.

With a workspace file, the answer is simple: segment using load and structure the source tree initially on a per-language basis.

I see the concern. At the surface segmentation as suggested in #14632 sounds like a good solution. However, I can understand that there may be significant complications when it comes to implementation. Though, as I understand the proposal, the idea is not for the segments to be independent. It's more about being able to assign clear ownership to sub-sections of MODULE.bazel and that is easiest to do if they can be in separate files. So, perhaps the obstacles are not that big.

I would imagine that a prototype of this approach could already be implemented today with a bit of automation to generate the composed MODULE.bazel file. Essentially, perform the suggested unrolling locally. Perhaps this could help identify if the approach works in practice or if it fails on the empty promise highlighted in https://github.com/bazelbuild/bazel/issues/14632#issuecomment-1230186595.

cgrindel commented 1 year ago

I would imagine that a prototype of this approach could already be implemented today with a bit of automation to generate the composed MODULE.bazel file. Essentially, perform the suggested unrolling locally. Perhaps this could help identify if the approach works in practice or if it fails on the empty promise highlighted in https://github.com/bazelbuild/bazel/issues/14632#issuecomment-1230186595.

This is a great point. We could even provide some rules/macros to make this pretty seamless.

During build, generate a MODULE.bazel from one or more parts.
During test, compare the source tree version of MODULE.bazel and the generated one.
Provide a run target to update the source tree with the generated one.

joca-bt commented 1 year ago

Did we agree on any way to move forward?

rickeylev commented 1 year ago

Fabian pointed me over here from a rules_python PR where we're trying to figure out how to let users express patching. e.g., given a transitive closure of stuff from PyPI, we generate a bunch of stuff and extract various downloaded things, and users want to patch different parts of the whole process and results.

The comments on our PR are a long, so I'll try to summarize:

A large crux of the issue is how quickly a MODULE file can grow in order to specify what to patch. A simple tag_class such as:

pip.patch(
  package = "numpy"
  version = "13.12.1",
  python_version = "3.10",
  platform = "darwin",
  ...
  patches = [...],
)

Quickly turns into a very large MODULE config. For Google, for example, we'd easily have hundreds of patches to apply. Any mono-repo is probably going to have similar problems; patching external dependencies for Bazel is very common. For a non-mono-repo situation, I would advise looking at Tensorflow: they have a workspace sharded into 4 sub-workspace bzl files.

Like others mentioned, the main issue, to me at least, is that managing it all in one file presents challenges for efficiently and safely reviewing. A simple textual include would go a long way (as unappealing as that sounds).

Wyverald commented 1 year ago

Re-reading this thread, it seems that most people are in favor of an import-like statement that allows segmentation of the MODULE.bazel file without introducing any "resolution" logic between the segments. This statement would also only be allowed in the root module (so no dependencies can have a MODULE.bazel with imports).

This is essentially a revival of @andyrinne12's original design, with the explicit decision to throw an error when two segments conflict in any way (multiple bazel_deps on the same module, same repo name used multiple times, etc).

It sounds workable to me, but I wanted to explicitly point out the difficulties Andrei encountered when implementing the "merging the segments into one MODULE.bazel file" helper tool. As the MODULE.bazel file can contain many name bindings (e.g. maven = use_extension("...", "maven")), merging the files isn't as straightforward as a concatenation. In the design linked above, Andrei suggested that we could rename symbols to avoid conflicts (e.g. the second maven would be renamed to maven2), but that's IMO getting way too complicated.

So we could go ahead with this, but not promise the automatic merger tool, just to solve the monorepo use case. I also wonder, given that we're mostly addressing the monorepo use case, how feasible it is for people to try to have a pre-build step tool that they run themselves to assemble a MODULE.bazel file from multiple segments (just simple textual inclusion). Presumably updates to the MODULE.bazel file are infrequent enough that this shouldn't be too annoying?

To briefly address the other approach, which involves allowing a load that essentially only allows loading POJOs and "helper macros":

[+] it makes data sharing with other .bzl loaders possible (e.g. a VERSION constant)
[-] it doesn't allow separating the use_extension calls into a different file
[-] it's even harder (maybe actually impossible) to provide an automatic merger tool
[+] it doesn't technically introduce a new concept

I'm interested to hear more thoughts about this approach too.

fmeum commented 1 year ago

[-] it's even harder (maybe actually impossible) to provide an automatic merger tool

With both the load and the import implementation, an automatic merger tool could borrow the IIFE (Immediately Invoked Function Expression) trick from Javascript bundlers, assuming we allow def usage in the combined MODULE.bazel. If we wanted to, we could allow def only in module files that begin with a magic comment such as # Automatically generated. Do not edit.

Example for load, import could look similar, just without the returns:

# lib.bzl
MY_VERSION = "1.2.3"

def add_foo_deps(foo):
    foo.tag()
    foo.other(tag)

# MODULE.bazel
load(":lib.bzl", "add_foo_deps", RENAMED_MY_VERSION = "MY_VERSION")
module(name = "my_module", version = RENAMED_MY_VERSION)

foo = use_extension("@foo2//:ext.bzl")
add_foo_deps(foo)
use_repo(foo, ...)

# MODULE.bazel.combined
def lib_func():
    MY_VERSION = "1.2.3"

    def add_foo_deps(foo):
        foo.tag()
        foo.other(tag)

    return struct(
        MY_VERSION = MY_VERSION,
        add_foo_deps = add_foo_deps,
    )
lib_exports = lib_func()

module(name = "my_module", version = lib_exports.MY_VERSION)

foo = use_extension("@foo2//:ext.bzl")
lib_exports.add_foo_deps(foo)
use_repo(foo, ...)

Using the buildtools Go package to perform this on the AST rather than the text content level should make this both quite doable and reliable.

So we could go ahead with this, but not promise the automatic merger tool, just to solve the monorepo use case. I also wonder, given that we're mostly addressing the monorepo use case, how feasible it is for people to try to have a pre-build step tool that they run themselves to assemble a MODULE.bazel file from multiple segments (just simple textual inclusion). Presumably updates to the MODULE.bazel file are infrequent enough that this shouldn't be too annoying?

If a project such as Tensorflow, which should at some point become a BCR module, already reaches the scale where segmentation would help, then we should either have a merge tool from day 1 or at least have a somewhat concrete plan in mind for how to write one if it becomes necessary. Textual inclusion is very brittle and letting each user figure this out by themselves sounds wasteful to me. It would probably also result in cases where issues with custom tooling are attributed to Bazel itself.

[-] it doesn't allow separating the use_extension calls into a different file

I see that as both a pro and a con: You know all the types of deps that your project relies on just by looking at the root MODULE.bazel file. With import, we would need to extend tooling such as Gazelle and buildozer to traverse the full import tree when searching for or modifying a bazel_dep or use_repo. But this is definitely doable, so it's not a blocker, just something to keep in mind.

fmeum commented 1 year ago

Adding to my points above, I think that the biggest drawback of the load approach is that it can't deal with use_repos well:

If we don't allow use_repo in loaded files and also don't allow positional argument expansion, then adding a new module extension-managed dependency still requires a change to MODULE.bazel, which can be a challenge in monorepos with strong code owner protection.

If we do allow use_repo in loaded files or allow positional argument expansion, while the use_repo call itself or the repo list can be extracted into a Starlark constant and loaded from a separate file, we lose the automatic use_repo fixup capability.

I'm thus also leaning towards import now.

Wyverald commented 1 year ago

If we wanted to, we could allow def only in module files that begin with a magic comment such as # Automatically generated. Do not edit.

Even if we were to allow this (I'm not convinced), this is only the beginning of our troubles. Loaded .bzl files can load further .bzls, and theoretically you could be "including" a .bzl file multiple times with your proposed solution. (Consider a MODULE.bazel that loads from a.bzl and b.bzl, both of which load from c.bzl.) Not to mention the resulting merged MODULE.bazel file is pretty hard to read, with magical re-exports and what not.

For a non-mono-repo situation, I would advise looking at Tensorflow: they have a workspace sharded into 4 sub-workspace bzl files.

If a project such as Tensorflow, which should at some point become a BCR module, already reaches the scale where segmentation would help, [...]

IIUC TF needs to segment their WORKSPACE file because, well, it's a WORKSPACE file -- you need to call a macro that instantiates group A of repos --> load from group A --> call macros in group A that instantiate group B --> load from group B --> rinse and repeat. IOW the 4-part sharding is done due to technical limitations, not (just) for ease of management.

I'm not (yet) convinced that TF's prospective MODULE.bazel file would be unmanageable unless sharded.

rickeylev commented 1 year ago

For TF, it's the top-level part that caught my attention; it appears to be nothing more than sharding 1000+ lines into 4 separate files for the A->B->C->D dependency graph height/depth.

https://github.com/tensorflow/tensorflow/blob/master/WORKSPACE#L74

IOW the 4-part sharding is done due to technical limitations, not (just) for ease of management.

As I understand workspace, the sharding shouldn't affect the semantic behavior? i.e., you should be able to remove the sharding and inline it into WORKSPACE directly, so long as the particular order of loads/statements is the same. The result would be about +1000 lines to the main WORKSPACE file, but I don't see why it wouldn't work.

I'm not (yet) convinced that TF's prospective MODULE.bazel file would be unmanageable unless sharded.

Maybe sketch out what a translation to MODULE would look like? (maybe you already have?) workspace2.bzl seems to be the big one. I don't necessarily disagree with you; I could see it going either way. Superficially, most of the TF workspace config looks like it'd be simple bazel_dep() lines. But, there's also a lot of little things in there. Like almost everything has a custom mirror url, there's about 40+ patches, and they don't use plain http_file, but a custom wrapper with some extra args that do IDK what. And we know that some "design patterns" for WORKSPACE macros didn't have a clean, direct, translation to bzlmod.

Wyverald commented 1 year ago

As I understand workspace, the sharding shouldn't affect the semantic behavior? i.e., you should be able to remove the sharding and inline it into WORKSPACE directly, so long as the particular order of loads/statements is the same. The result would be about +1000 lines to the main WORKSPACE file, but I don't see why it wouldn't work.

Well yes. My assumption was that those 4 macros are what you need to put in your own WORKSPACE if you're a project with a direct dep on TensorFlow, though that assumption may not be accurate since a brief search on their docs didn't yield any instructions on "how to use TF as a Bazel project dependency". Maybe they never intended TF to be used like that? In which case our whole debate is moot since it'd be a monorepo use case. But anyway, that assumption is why I said the 4-part sharding is due to technical limiations (for example, it cannot be just 1 shard).

Wyverald commented 1 year ago

Following up on this: we briefly discussed this during the BoF session at BazelCon, and the general sentiment was that this was still desired functionality, but mostly for top-level projects (that is, projects not intended to be depended upon by others; also big monorepos at companies). So we'll likely be doing what this comment says (import statements for MODULE.bazel files that are illegal for non-root modules). Tentatively targeting 7.1 or 7.2.

tyler-french commented 1 year ago

Following up on this: we briefly discussed this during the BoF session at BazelCon, and the general sentiment was that this was still desired functionality, but mostly for top-level projects (that is, projects not intended to be depended upon by others; also big monorepos at companies). So we'll likely be doing what this comment says (import statements for MODULE.bazel files that are illegal for non-root modules). Tentatively targeting 7.1 or 7.2.

This sounds great! I want to emphasize a couple of things that we discussed at BazelCon:

The desire is just to have a way to segment and distribute the large file for the root module into smaller pieces.
Allowing this in a registry or the BCR would be detrimental and should be explicitly disallowed (treated like overrides)

Thanks again! cc @linzhp

iancha1992 commented 1 year ago

@bazel-io fork 7.1.0

iancha1992 commented 6 months ago

A fix for this issue has been included in Bazel 7.2.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.0rc1. Thanks!

bazelbuild / bazel

Support (restrictive) load in MODULE.bazel #17880

Description of the feature request:

What underlying problem are you trying to solve with this feature?

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

Scenario 1: A repo that uses a ruleset to access third-party dependencies

Scenario 2: A repo with many dependencies.

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?