Closed joca-bt closed 7 months ago
I don't think I'm exaggerating that much with the following claim: The availability of and reliance on load
in WORKSPACE
was a major contributor to its subpar performance.
Module extensions can accept label attributes on their tags and collect dependency information from the files pointed to by these labels. Without any changes to Bazel itself, rules_jvm_external could allow for the following usage where artifacts.json
is some other file in your repo with e.g. an array of Maven coordinates:
maven = use_extension("@rules_jvm_external//:extensions.bzl", "maven")
maven.install(artifacts = "//:artifacts.json")
This would keep your MODULE.bazel file clean and your external Java dependencies separated from all the others while still allowing for fast discovery of the entire dependency tree.
What @fmeum said; although, we might explore allowing load
from the current repository for the root module only, to better support large monorepos that will never be used as a dependency.
Would there be an option of using load
in a local repository, but when we upload to something like the BCR, having the load
statements inlined so there's just the single MODULE.bazel
file without any external dependencies?
See this comment in #14632 for a previous abandonded attempt.
We don't currently have the cycles to work on this. But if anyone is interested in picking up the work and writing a design for it (addressing the concerns in the comment above), we'd be happy to review PRs.
We could:
load
in the root module only;load
to .bzl
files in the root module (including transitive load
s);MODULE.bazel
globals to the environment of the loaded .bzl
files. Instead users will have to pass extension proxies into macros explicitly or load plain Starlark data. Edit: Possibly even better, prohibit calling the globals (but not proxy tags) in loaded files, which gives better error messages and prevents passing them into macros.The last restriction makes it less likely for two different loaded files to define conflicting bazel_dep
s. Typical usages could look like this:
load("//:go.bzl", "declare_go_overrides", "GO_REPOS")
go_deps = use_extension(...)
go_deps.from_file(go_mod = "//:go.mod")
declare_go_overrides(go_deps)
use_repo(go_deps, *GO_REPOS)
One situation in which it would be helpful to use load is when configuring external package managers, e.g. like with rules_jvm_external. On a Java project it is not unusual to depend on several external jars. Putting that configuration in
MODULE.bazel
can pollute the file and make it more difficult to maintain.
The pain point of maintaining a large MODULE.bazel file is completely valid. But the solution of allowing load
statements in there could be the wrong route to turn.
Looking into the current Go ecosystem, I don't see anyone asking for the ability to "load / import" within go.mod / go.sum
files. The reason is go mod tidy
provides everything you need to manage a go.mod
file in a Go project: populating the content of the file, sorting, formatting entries, etc... An average go developer would not even look into a go.mod file as the tool did a great job at managing it. As a result, it's very rare to hear that the file is too big to manage.
So is it possible for us to solve this problem with better automation tooling than relying on dynamic template loading? For example: perhaps a bazel/gazelle mod tidy
could help us populate the MODULE.bazel
file automatically by scanning through the existing project. The convention of how to arrange and sort the content of MODULE.bazel
could also be reinforced there to avoid style-related arguments.
One situation in which it would be helpful to use load is when configuring external package managers, e.g. like with rules_jvm_external. On a Java project it is not unusual to depend on several external jars. Putting that configuration in
MODULE.bazel
can pollute the file and make it more difficult to maintain.The pain point of maintaining a large MODULE.bazel file is completely valid. But the solution of allowing
load
statements in there could be the wrong route to turn.Looking into the current Go ecosystem, I don't see anyone asking for the ability to "load / import" within
go.mod / go.sum
files. The reason isgo mod tidy
provides everything you need to manage ago.mod
file in a Go project: populating the content of the file, sorting, formatting entries, etc... An average go developer would not even look into a go.mod file as the tool did a great job at managing it. As a result, it's very rare to hear that the file is too big to manage.So is it possible for us to solve this problem with better automation tooling than relying on dynamic template loading? For example: perhaps a
bazel/gazelle mod tidy
could help us populate theMODULE.bazel
file automatically by scanning through the existing project. The convention of how to arrange and sort the content ofMODULE.bazel
could also be reinforced there to avoid style-related arguments.
The reason why go is allowed to do so is because the source of truth of dependencies relies in the source tree, therefore it's okay to let tool generate it.
But for module.bazel, it's already the source of truth, and there are lots of human written parts in it, eg. repo override configurations, etc. Automation is not entirely feasible .
@ashi009 In my mind, a command such as bazel mod tidy
should solve several things:
Update MODULE.bazel
with appropriate configs using current source tree.
It's possible that somewhere in the source tree, a user would add to their BUILD file a load
statement referring to an external rule. bazel mod tidy
should be able to detect such usage and automatically it into MODULE.bazel
Consistent formatting of MODULE.bazel
.
There should be an extensible set of config grouping in MODULE.bazel, one for each popular languages. The order and format of these groups within MODULE.bazel file should be deterministic so that we could run bazel mod tidy
to format it over time.
Similar to go.mod
's replace
directive, there should be an override section for folks to customize. However, I think the goal should be to keep this section small and limited. More extensive re-configuration and patching could/should be done by using a different BzlMod Registry instead of using MODULE.bazel
file. The custom registry could be an in-tree directory managed by source control so the setup could be relatively simple. You could also use multiple registries overlaying each others thus re-using Bazel central registry is possible.
I think at minimum, we should defer the addition of load / import
within MODULE.bazel
after BzlMod has been out for a while. It would let us direct the solutions toward (1) better automation tooling to manage the file and (2) alternative registries as much as possible. After the initial GA of BzlMod, we should be able to identify problems that could not be solved using either of the above approaches and add load / import
if needed.
The reverse, however, is not feasible. Once load / import
is added, we will have folks depending on that behavior, and taking it away will cause heavy churn/migration.
@sluongng Totally agree on the first two. However, not true for the third. As at most of the time we are overriding the extension generated repos.
In a relatively large mono repo, the number of repos that need a second touch is a lot. Our current non-bzlmod rules_go generated bzl file with lots of patches and custom gazelle directives is over 17k loc.
I don't see how this gonna play out in a single module.bazel file with the help of any tool. Even that is feasible, reviewing that file is also challenging -- how do we delegate a portion of a file to some reviewer while other portion to some others?
Also a note on mirroring the bcr, it's not a valid solution for most of the users, it adds too much toil for who ever maintaining that mirror -- just thinking about pulling upstream releases and resolving merge conflicts.
Update MODULE.bazel with appropriate configs using current source tree. It's possible that somewhere in the source tree, a user would add to their BUILD file a load statement referring to an external rule. bazel mod tidy should be able to detect such usage and automatically it into MODULE.bazel
I'm not sure this is feasible. Unlike Go where the import path fully identifies the module, in Bazel/Bzlmod you just see people using @rules_foo//:something
and you've no idea whether that's actually the rules_foo
you have on the BCR. Maybe they intended to import the module company_rules_foo
as rules_foo
, for example.
This is even less tractable if you consider module extensions and repos generated from them.
@ashi009 Just a quick clarification. With the custom registry solution, you don't need to mirror the entire BCR. The custom registry only needs to contain the customized modules.
Also a note on mirroring the bcr, it's not a valid solution for most of the users, it adds too much toil for who ever maintaining that mirror -- just thinking about pulling upstream releases and resolving merge conflicts.
@ashi009 I could be wrong, so please do correct me if I am. Here, I am not suggesting mirror-ing. I am suggesting using multiple registries on top of each others:
# .bazelrc
build --registry=./internal/registry/
build --registry=https://bcr.bazel.build
So you would only need to maintain your internal module customizations.
I'm not sure this is feasible. Unlike Go where the import path fully identifies the module, in Bazel/Bzlmod you just see people using @rules_foo//:something and you've no idea whether that's actually the rules_foo you have on the BCR. Maybe they intended to import the module company_rules_foo as rules_foo, for example.
This is even less tractable if you consider module extensions and repos generated from them.
@Wyverald I think a best-effort lookup from the existing registries should be sufficient.
In case a repo is missing, simply provide the users with a warning/error message. go mod tidy
does the same when a referenced module could not be found. This is also why enterprises often operate their own internal Go Proxy pointing to internal modules.
For module extension, we could use the approach existing in gazelle
, which is to provide directives in the form of comments. These directives will help guide the automation tool to map a repo name to a group/path inside MODULE.bazel
. After a while, we could identify some common directives and upgrade support for them to the registry level.
The custom registry could be an in-tree directory managed by source control so the setup could be relatively simple.
I think @sluongng idea for an in-repo registry could be the answer. It satisfies two important criteria in my mind:
@Wyverald I have not dug into the Bazel code for the registry yet. At first blush, it seems that if the on-disk custom registry had the correct layout, it would just be a matter of saying to Bazel "look over here" instead of "download from there". Am I thinking about this correctly?
I have not dug into the Bazel code for the registry yet. At first blush, it seems that if the on-disk custom registry had the correct layout, it would just be a matter of saying to Bazel "look over here" instead of "download from there". Am I thinking about this correctly?
Indeed. The only caveat is that you'd need to use a local_path
type for your source.json files: https://bazel.build/external/registry#:~:text=The%20type%20can%20be%20changed%20to%20use%20a%20local%20path
@Wyverald The doc is giving me the impression that
When
source.json
type islocal_path
, then the registry must be hosted locally.
It does not imply the reverse which I think you are implying though:
When a registry is hosted locally, all the
source.json
in it must have typelocal_path
Am I understanding this wrongly? Because I was expecting that archive
type would still work with a local registry.
You're right -- I didn't try to imply that you couldn't use the archive
type with local registries. I just had the impression that Chuck was asking for a no-download, just-use-this-directory registry, but looking back now I might have misinterpreted his comment.
I just had the impression that Chuck was asking for a no-download, just-use-this-directory registry
This is what I am asking. Thanks. 🙂
I wonder if there might be two use cases that may deserve their own solutions:
rules_jvm_external
, rules_go
, rules_swift_package_manager
).There are two parts to this scenario.
use_repo(go_deps, ...)
)The second part is a bit of a pain, today. I provided a rudimentary way to generate this for clients in rules_swift_package_manager
. I believe that a more universal solution is being worked.
Here are two thoughts on how to eliminate listing the declared repositories in the use_repo(...)
declaration:
use_repos
to accept a file that lists the repos (e.g. use_repo(go_deps, from_file = "go_repos.json")
). A benefit to this approach is that the ruleset maintainers can provide tooling to generate the file for the client.If the desire is to separate the MODULE.bazel
into different files so that it can be organized, that does seem reasonable. However, as has been mentioned previously, the traditional load()
is problematic. Let's call it something completely different with a name that does not imply anything but what it does (e.g import_module_bazel()
). There would be no change to the semantics on how the MODULE.bazel
is processed.
# File: MODULE.bazel
import_module_bazel(
"MODULE.runtime.bazel",
"MODULE.dev.bazel",
)
# File: MODULE.runtime.bazel
bazel_dep(name = "rules_jvm_external", version = "1.0.0")
import_module_bazel("MODULE.jvm.bazel")
# File: MODULE.jvm.bazel
jvm_deps = use_extension(...)
jvm_deps.from_file("jvm_deps.json")
use_repo(
jvm_deps,
# List of repositories declared by jvm_deps from jvm_deps.json
)
# File: MODULE.dev.bazel
bazel_dep(
name = "gazelle",
version = "0.29.0",
dev_dependency = True,
)
The import of the files would work similarly to .bazelrc
imports. In addition, it would be an error to import the same file multiple times.
With this approach, we do not introduce the baggage that load()
might bring while allowing maintainers to organize their external dependency declarations as they wish.
One thing to bear in mind: in large repos, it's not desirable for people to edit files in the top-level directory (such as MODULE.bazel
) and getting reviews for changes in these files can be prohibitively long or complicated.
With a workspace file, the answer is simple: segment using load
and structure the source tree initially on a per-language basis. This broadens the pool of people who can review specific parts of the workspace file. Without a load
option, there'd be no way of doing this using bzlmod
.
One thing I'd suggest would be to allow load
statements in MODULE.bazel
files, but when uploading to a registry, the load
statements are "unrolled" so we end up with a single file, as we have now. This allows people the flexibility to organise their source tree how they want, and also retains the simplicity of the existing model (and it's similar to how search engines use spiders with JS enabled, but allow a very limited number of JS evaluations in order to get meaningful results)
We already have tooling for calculating the hashes in BCR, so performing some more work when adding or updating a module doesn't seem unreasonable.
If this approach is taken, then it would be necessary for overridden modules to also use load
statements.
This doesn't preclude us providing better tooling for handling third party deps.
@shs96c what do you think about using a local registry exposing custom modules+extensions to solve that problem?
For example: in a monorepo with both Go and Java, the local registry could expose Go dependencies via 1 custom module+extension and Java in another. On disk, these will be 2 different directories under /modules/{internal_go,internal_java}
and thus you could apply CODEOWNERS / OWNERS control appropriately.
Then, the MODULE.bazel
file will effectively be
bazel_dep(name = "internal_java", version = "1.0")
maven = use_extension("@internal_java//:extensions.bzl", "maven")
maven.install()
bazel_dep(name = "internal_go", version = "1.0")
go_mod = use_extension("@internal_go//:extensions.bzl", "go_mod")
go_mod.install()
with .bazelrc
file being
common --registry=./local_registry
common --registry=https://bcr.bazel.build
and you would not have to edit it as much.
Just want to clarify some points that have been brought up so far:
There is the declaration of the language-specific dependencies. This seems to be well handled by the ruleset providing an extension tag to load the file.
Keep in mind that some languages lack a well-established format for dependency specifications. While Go, Python and Node have this, Java doesn't. Of course such a format could be invented, but that leads to the question "Why not just Starlark"?
Allow extensions to return a provider with the repositories that they provide.
This exists in the form of module_ctx.extension_metadata
, but it only provides a warning with fixup commands, it can't straight up replace a use_repo
call unless we accept the cost of eagerly evaluating this extension for every build. When the lock file support matures, that may even be an option.
Update use_repos to accept a file that lists the repos (e.g. use_repo(go_deps, from_file = "go_repos.json")). A benefit to this approach is that the ruleset maintainers can provide tooling to generate the file for the client.
At least for the root module this could work, but I wonder whether this is really better than load
. Starlark supports comments and trailing commans while being slightly less verbose than JSON. Rulesets could still generate the Starlark file, although I think module_ctx.extension_metadata
is the way to do this. We could (and e.g. @tyler-french has already asked about this) consider having it output a file in addition to printing fixup commands to the console.
import_module_bazel
(or any other.bazelrc
style import mechanism)
This suffers from the issues explained in https://github.com/bazelbuild/bazel/issues/14632#issuecomment-1230186595. A huge problem with WORKSPACE was its iterative nature, where any transitively called macro could have the side effect of registering a dependency. With Bzlmod, this would result in a hard error instead of a silent change of the dependency graph, which is better, but would still not result in strict decoupling.
If we restrict load
in a way similar to https://github.com/bazelbuild/bazel/issues/17880#issuecomment-1574144457, we can ensure that statements with side effects can only be executed at the top level, that is, right in the MODULE.bazel
file. Essentially, the loaded files will only provide data, not code.
unrolling
load
Since MODULE.bazel doesn't rely on the "export hack" where the name of a variable is attached to the object it references, this may actually be feasible with a lot of name mangling. It would probably best live in buildozer. @vladmos Has the idea of inlining loads already been brought up at some point? Do you think that this would be feasible to implement for .bzl
dialects that do not rely on the export hack?
custom registries per language
@sluongng How do you imagine the code you posted above to work with strict deps?
bazel_dep(name = "internal_go", version = "1.0")
go_mod = use_extension("@internal_go//:extensions.bzl", "go_mod")
go_mod.install()
internal_go
would presumably use go_deps
under the hood to declare the actual Go dependencies, but that would still require the root module to use_repo
them. Since module extensions can't provide tags to other module extensions, I don't see what the go_mod
extension would do in this example.
A custom registry is a great way to reduce the amount of *_override
s in the MODULE.bazel
file, but I am not sure it can provide the same benefit for extension-managed non-Bazel-aware dependencies.
How do you imagine the code you posted above to work with strict deps?
internal_go would presumably use go_deps under the hood to declare the actual Go dependencies, but that would still require the root module to use_repo them. Since module extensions can't provide tags to other modules, I don't see what the go_mod extension would do in this example.
I think I misunderstood that extensions could call use_repo
, hence my suggestion above.
I will have to think about this a bit more. But if use_repo
is the only blocker, then perhaps we should just solve it there. For example, allowing something like use_repo(go_mod, use_all = True)
to load all declarations in mctx.extension_metadata
eagerly with use_all = False
by default to load things lazily instead. It's applicable, especially for custom local modules where you already have fine-grain control of what's going to be included within the module's extension logic.
Ideally, I think we should leverage a custom local registry as the go-to setup, which would enable large monorepo(es) to have a sane third-party dependencies management system. We simply should not invent a second "load" system so that half of the third-party deps config is managed in an internal registry (local or remote) and half is managed in a bazel/build_rules/override
dir somewhere in a monorepo. It's not ergonomic and unfriendly to newer users. So I would favor approaches that would add more features/functionalities to the existing registry setup over repeating the load
approach of WORKSPACE.
For example, allowing something like use_repo(go_mod, use_all = True) to load all declarations in mctx.extension_metadata eagerly with use_all = False by default to load things lazily instead.
This could work (except not with that syntax since use_repo
already accepts arbitrary keyword arguments). Its performance impact could be softened by Bzlmod lock files.
But if use_repo is the only blocker, then perhaps we should just solve it there.
I am not sure that it is though. @ashi009 brought up overrides of deps fetched by module extensions and @shs96c brought up rules_jvm_external's specifications of Java deps as other cases that aren't that usable yet. If anyone has other cases, please add them so that we know the full extent of the problem we need to solve.
So I would favor approaches that would add more features/functionalities to the existing registry setup over repeating the load approach of WORKSPACE.
I fully agree that we should never bring WORKSPACE's load semantics to Bzlmod. But loads that are free of side effects would already be much better than that.
This suffers from the issues explained in https://github.com/bazelbuild/bazel/issues/14632#issuecomment-1230186595. A huge problem with WORKSPACE was its iterative nature, where any transitively called macro could have the side effect of registering a dependency. With Bzlmod, this would result in a hard error instead of a silent change of the dependency graph, which is better, but would still not result in strict decoupling.
@Wyverald @fmeum I am not convinced this is as hard to maintain as is suggested in the comment. Also, I don't think that the goal is strict decoupling. The goal is to organize the dependencies in a way that makes sense for the client. We should just focus on how to help the client identify conflicts and not shoot themselves in the foot.
For example, if two segments both declare a bazel_dep on the same module, we'd just error out since a module can't depend on the same module twice (without a multiple-version override, that is).
Failing is good. The trick is to be clear why it is failing and where the conflicting declaration resides.
One might be tempted to say "just pick the highest version"
I do not think that we should do this. A well-crafted error feels like the right result to me.
What if segment A declares bazel_dep(X@1.0, repo_name="foo"), and segment B declares bazel_dep(Y@1.0, repo_name="foo")?
This could happen in a large single MODULE.bazel
file. We should provide an error that helps the client find the conflict.
There is indeed a parallel with import in .bazelrc files, but that statement is usually used to import user-defined RC files, instead of segmenting a long config file. There's much less of a need to import a "user-defined segment of MODULE.bazel directives".
I disagree. I do use multiple bazelrc files to organize the configuration and share it with child workspaces. However, I think that the real point is that the "import" concept would help clients organize their files while still maintaining the strict standards that already exist with Bazel modules.
One thing to bear in mind: in large repos, it's not desirable for people to edit files in the top-level directory (such as MODULE.bazel) and getting reviews for changes in these files can be prohibitively long or complicated.
With regard to @shs96c point, in my experience with mono repos, the segmentation of external dependencies is almost always by language. If that is the primary case, there is little opportunity for code owners for different languages to conflict with their repository names.
With regard to using load()
vs a new "import" function, I think that we would be best served not to special-case load()
for MODULE.bazel
files. The load()
concept means something in Starlark and that is not what is being suggested by introducing "import" into MODULE.bazel
files. The end result of an "import" is a single contiguous file with declarations.
Is it worth separating the "reusable module" and "big monorepo workspace" into separate files?
It seems to me that the tension in trying to have a single place to configure both is what's making it difficult to decide on a direction.
You could imagine having a WORKSPACE
file where the workspace()
call at the top has some extra parameter indicating that a semantic halfway between "load anything anywhere" and "pre-declare everything" must be used. This would forbid directly using load()
from WORKSPACE
, but would replace it with some new functions that let the author more directly express what they want.
Rough sketch with placeholder names:
# WORKSPACE.bazel
workspace(
name = "my_monorepo",
only_module_deps = True,
)
# Load a file from the local workspace that contains a "workspace rule", which can
# be used for setting up the workspace's module hierarchy.
#
# The `use_repos` call is like bzlmod `use_repo`, but since enumerating all the
# external repo names would defeat the purpose, it just takes a single workspace
# rule to load externals for.
#
third_party = load_workspace_rule("//third_party:third_party.bzl", "third_party")
use_repos(third_party)
# alternatively, maybe just call it? I don't know which API the Bazel folks would prefer here.
third_party()
# third_party/third_party.bzl
load("//third_party/github.com/google/googletest:googletest.bzl", "googletest")
def _third_party(workspace_ctx):
googletest(workspace_ctx)
# ...
# `workspace_rule` is a bit like `module_extension`, allowing controlled definition of external repos.
#
# Since the `WORKSPACE` file is only evaluated in the root module, this functionality cannot leak
# into BCR modules.
third_party = workspace_rule(_third_party)
# third_party/github.com/google/googletest/googletest.bzl
def googletest(workspace_ctx):
# works like `MODULE.bazel` global `bazel_dep` -- other module functions can also be provided
# so that a multi-level dependency graph (rules_go -> gazelle -> Go third-party libs) can be expressed.
workspace_ctx.bazel_dep(
name = "googletest",
version = "1.12.1",
dev_dependency = True,
repo_name = "com_google_googletest",
)
My understanding is that extension_metadata
together with buildozer automation as described in the proposal would solve the issue of maintaining large use_repo
stanzas for third party dependencies like Go modules or Maven dependencies. This approach keeps MODULE.bazel files declarative and avoids the kind of statefulness that WORKSPACE files have.
A use_all = True
alike concept in conjunction with lock-files could be a natural extension of this.
One thing to bear in mind: in large repos, it's not desirable for people to edit files in the top-level directory (such as
MODULE.bazel
) and getting reviews for changes in these files can be prohibitively long or complicated.With a workspace file, the answer is simple: segment using
load
and structure the source tree initially on a per-language basis.
I see the concern. At the surface segmentation as suggested in #14632 sounds like a good solution. However, I can understand that there may be significant complications when it comes to implementation. Though, as I understand the proposal, the idea is not for the segments to be independent. It's more about being able to assign clear ownership to sub-sections of MODULE.bazel and that is easiest to do if they can be in separate files. So, perhaps the obstacles are not that big.
I would imagine that a prototype of this approach could already be implemented today with a bit of automation to generate the composed MODULE.bazel file. Essentially, perform the suggested unrolling locally. Perhaps this could help identify if the approach works in practice or if it fails on the empty promise highlighted in https://github.com/bazelbuild/bazel/issues/14632#issuecomment-1230186595.
I would imagine that a prototype of this approach could already be implemented today with a bit of automation to generate the composed MODULE.bazel file. Essentially, perform the suggested unrolling locally. Perhaps this could help identify if the approach works in practice or if it fails on the empty promise highlighted in https://github.com/bazelbuild/bazel/issues/14632#issuecomment-1230186595.
This is a great point. We could even provide some rules/macros to make this pretty seamless.
MODULE.bazel
from one or more parts.MODULE.bazel
and the generated one.Did we agree on any way to move forward?
Fabian pointed me over here from a rules_python PR where we're trying to figure out how to let users express patching. e.g., given a transitive closure of stuff from PyPI, we generate a bunch of stuff and extract various downloaded things, and users want to patch different parts of the whole process and results.
The comments on our PR are a long, so I'll try to summarize:
A large crux of the issue is how quickly a MODULE file can grow in order to specify what to patch. A simple tag_class such as:
pip.patch(
package = "numpy"
version = "13.12.1",
python_version = "3.10",
platform = "darwin",
...
patches = [...],
)
Quickly turns into a very large MODULE config. For Google, for example, we'd easily have hundreds of patches to apply. Any mono-repo is probably going to have similar problems; patching external dependencies for Bazel is very common. For a non-mono-repo situation, I would advise looking at Tensorflow: they have a workspace sharded into 4 sub-workspace bzl files.
Like others mentioned, the main issue, to me at least, is that managing it all in one file presents challenges for efficiently and safely reviewing. A simple textual include would go a long way (as unappealing as that sounds).
Re-reading this thread, it seems that most people are in favor of an import
-like statement that allows segmentation of the MODULE.bazel file without introducing any "resolution" logic between the segments. This statement would also only be allowed in the root module (so no dependencies can have a MODULE.bazel with import
s).
This is essentially a revival of @andyrinne12's original design, with the explicit decision to throw an error when two segments conflict in any way (multiple bazel_dep
s on the same module, same repo name used multiple times, etc).
It sounds workable to me, but I wanted to explicitly point out the difficulties Andrei encountered when implementing the "merging the segments into one MODULE.bazel file" helper tool. As the MODULE.bazel file can contain many name bindings (e.g. maven = use_extension("...", "maven")
), merging the files isn't as straightforward as a concatenation. In the design linked above, Andrei suggested that we could rename symbols to avoid conflicts (e.g. the second maven
would be renamed to maven2
), but that's IMO getting way too complicated.
So we could go ahead with this, but not promise the automatic merger tool, just to solve the monorepo use case. I also wonder, given that we're mostly addressing the monorepo use case, how feasible it is for people to try to have a pre-build step tool that they run themselves to assemble a MODULE.bazel file from multiple segments (just simple textual inclusion). Presumably updates to the MODULE.bazel file are infrequent enough that this shouldn't be too annoying?
To briefly address the other approach, which involves allowing a load
that essentially only allows loading POJOs and "helper macros":
VERSION
constant)use_extension
calls into a different fileI'm interested to hear more thoughts about this approach too.
[-] it's even harder (maybe actually impossible) to provide an automatic merger tool
With both the load
and the import
implementation, an automatic merger tool could borrow the IIFE (Immediately Invoked Function Expression) trick from Javascript bundlers, assuming we allow def
usage in the combined MODULE.bazel
. If we wanted to, we could allow def
only in module files that begin with a magic comment such as # Automatically generated. Do not edit
.
Example for load
, import
could look similar, just without the return
s:
# lib.bzl
MY_VERSION = "1.2.3"
def add_foo_deps(foo):
foo.tag()
foo.other(tag)
# MODULE.bazel
load(":lib.bzl", "add_foo_deps", RENAMED_MY_VERSION = "MY_VERSION")
module(name = "my_module", version = RENAMED_MY_VERSION)
foo = use_extension("@foo2//:ext.bzl")
add_foo_deps(foo)
use_repo(foo, ...)
# MODULE.bazel.combined
def lib_func():
MY_VERSION = "1.2.3"
def add_foo_deps(foo):
foo.tag()
foo.other(tag)
return struct(
MY_VERSION = MY_VERSION,
add_foo_deps = add_foo_deps,
)
lib_exports = lib_func()
module(name = "my_module", version = lib_exports.MY_VERSION)
foo = use_extension("@foo2//:ext.bzl")
lib_exports.add_foo_deps(foo)
use_repo(foo, ...)
Using the buildtools
Go package to perform this on the AST rather than the text content level should make this both quite doable and reliable.
So we could go ahead with this, but not promise the automatic merger tool, just to solve the monorepo use case. I also wonder, given that we're mostly addressing the monorepo use case, how feasible it is for people to try to have a pre-build step tool that they run themselves to assemble a MODULE.bazel file from multiple segments (just simple textual inclusion). Presumably updates to the MODULE.bazel file are infrequent enough that this shouldn't be too annoying?
If a project such as Tensorflow, which should at some point become a BCR module, already reaches the scale where segmentation would help, then we should either have a merge tool from day 1 or at least have a somewhat concrete plan in mind for how to write one if it becomes necessary. Textual inclusion is very brittle and letting each user figure this out by themselves sounds wasteful to me. It would probably also result in cases where issues with custom tooling are attributed to Bazel itself.
[-] it doesn't allow separating the use_extension calls into a different file
I see that as both a pro and a con: You know all the types of deps that your project relies on just by looking at the root MODULE.bazel file. With import
, we would need to extend tooling such as Gazelle and buildozer to traverse the full import tree when searching for or modifying a bazel_dep
or use_repo
. But this is definitely doable, so it's not a blocker, just something to keep in mind.
Adding to my points above, I think that the biggest drawback of the load
approach is that it can't deal with use_repo
s well:
If we don't allow use_repo
in loaded files and also don't allow positional argument expansion, then adding a new module extension-managed dependency still requires a change to MODULE.bazel
, which can be a challenge in monorepos with strong code owner protection.
If we do allow use_repo
in loaded files or allow positional argument expansion, while the use_repo
call itself or the repo list can be extracted into a Starlark constant and loaded from a separate file, we lose the automatic use_repo
fixup capability.
I'm thus also leaning towards import
now.
If we wanted to, we could allow
def
only in module files that begin with a magic comment such as# Automatically generated. Do not edit
.
Even if we were to allow this (I'm not convinced), this is only the beginning of our troubles. Loaded .bzl files can load further .bzls, and theoretically you could be "including" a .bzl file multiple times with your proposed solution. (Consider a MODULE.bazel that loads from a.bzl and b.bzl, both of which load from c.bzl.) Not to mention the resulting merged MODULE.bazel file is pretty hard to read, with magical re-exports and what not.
For a non-mono-repo situation, I would advise looking at Tensorflow: they have a workspace sharded into 4 sub-workspace bzl files.
If a project such as Tensorflow, which should at some point become a BCR module, already reaches the scale where segmentation would help, [...]
IIUC TF needs to segment their WORKSPACE file because, well, it's a WORKSPACE file -- you need to call a macro that instantiates group A of repos --> load from group A --> call macros in group A that instantiate group B --> load from group B --> rinse and repeat. IOW the 4-part sharding is done due to technical limitations, not (just) for ease of management.
I'm not (yet) convinced that TF's prospective MODULE.bazel file would be unmanageable unless sharded.
For TF, it's the top-level part that caught my attention; it appears to be nothing more than sharding 1000+ lines into 4 separate files for the A->B->C->D dependency graph height/depth.
https://github.com/tensorflow/tensorflow/blob/master/WORKSPACE#L74
IOW the 4-part sharding is done due to technical limitations, not (just) for ease of management.
As I understand workspace, the sharding shouldn't affect the semantic behavior? i.e., you should be able to remove the sharding and inline it into WORKSPACE directly, so long as the particular order of loads/statements is the same. The result would be about +1000 lines to the main WORKSPACE file, but I don't see why it wouldn't work.
I'm not (yet) convinced that TF's prospective MODULE.bazel file would be unmanageable unless sharded.
Maybe sketch out what a translation to MODULE would look like? (maybe you already have?) workspace2.bzl seems to be the big one. I don't necessarily disagree with you; I could see it going either way. Superficially, most of the TF workspace config looks like it'd be simple bazel_dep() lines. But, there's also a lot of little things in there. Like almost everything has a custom mirror url, there's about 40+ patches, and they don't use plain http_file, but a custom wrapper with some extra args that do IDK what. And we know that some "design patterns" for WORKSPACE macros didn't have a clean, direct, translation to bzlmod.
As I understand workspace, the sharding shouldn't affect the semantic behavior? i.e., you should be able to remove the sharding and inline it into WORKSPACE directly, so long as the particular order of loads/statements is the same. The result would be about +1000 lines to the main WORKSPACE file, but I don't see why it wouldn't work.
Well yes. My assumption was that those 4 macros are what you need to put in your own WORKSPACE if you're a project with a direct dep on TensorFlow, though that assumption may not be accurate since a brief search on their docs didn't yield any instructions on "how to use TF as a Bazel project dependency". Maybe they never intended TF to be used like that? In which case our whole debate is moot since it'd be a monorepo use case. But anyway, that assumption is why I said the 4-part sharding is due to technical limiations (for example, it cannot be just 1 shard).
Following up on this: we briefly discussed this during the BoF session at BazelCon, and the general sentiment was that this was still desired functionality, but mostly for top-level projects (that is, projects not intended to be depended upon by others; also big monorepos at companies). So we'll likely be doing what this comment says (import
statements for MODULE.bazel files that are illegal for non-root modules). Tentatively targeting 7.1 or 7.2.
Following up on this: we briefly discussed this during the BoF session at BazelCon, and the general sentiment was that this was still desired functionality, but mostly for top-level projects (that is, projects not intended to be depended upon by others; also big monorepos at companies). So we'll likely be doing what this comment says (
import
statements for MODULE.bazel files that are illegal for non-root modules). Tentatively targeting 7.1 or 7.2.
This sounds great! I want to emphasize a couple of things that we discussed at BazelCon:
Thanks again! cc @linzhp
@bazel-io fork 7.1.0
A fix for this issue has been included in Bazel 7.2.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.2.0rc1. Thanks!
Description of the feature request:
Currently, the
MODULE.bazel
file must contain everything needed by the project within it, and cannot callload
statements to help make it easier to navigate.One situation in which it would be helpful to use
load
is when configuring external package managers, e.g. like withrules_jvm_external
. On a Java project it is not unusual to depend on several external jars. Putting that configuration inMODULE.bazel
can be considered as polluting the file and makes it more difficult to maintain.This was originally opened in https://github.com/bazelbuild/rules_jvm_external/issues/854 and that issue provides further details on a specific example with
rules_jvm_external
.Having the ability to use
load
to load constants or macros from other files would help us maintain a more organizedMODULE.bazel
.What underlying problem are you trying to solve with this feature?
No response
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?6.0.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response