bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.67k stars 3.97k forks source link

Support internal dependencies that are resolved on-disk via another tool #16380

Open vdye opened 1 year ago

vdye commented 1 year ago

Description of the feature request:

Add "filesystem adapter" API(s) so that when a source file isn't found on-disk during the loading phase, Bazel can fall back on a user-specified function to vivify the file (if it exists) rather than throwing an error.

This request comes from a desire to have Bazel work in a Git sparse-checkout environment without pre-specifying patterns (see below), but the feature probably shouldn't be scoped to just Git; for example, Mercurial has a similar sparse checkout feature. I'd be happy to see a feature like this be even more general purpose if there are other cases it could support!

What underlying problem are you trying to solve with this feature?

Internal dependencies in Bazel are implicitly required to be on-disk when performing Bazel operations (query, build, etc.) that involve loading the associated source files. However, in a Git sparse-checkout environment, source files are not present in the worktree (i.e., where Bazel is looking for them) unless vivified with git sparse-checkout add. Even when not on-disk, though, these source files are more consistent with internal dependencies (fixed to the same "version" as the rest of the repo, not a separate pre-built package, are checked out to source directories) than external ones.

Right now, the only way to work in a sparse-checkout with Bazel is to specify the files that a particular build will need before running bazel build. A user has to maintain two independent sources of truth for which files are needed in a build: the Bazel dependency graph, and the sparse-checkout patterns. If the not-on-disk sparse-checkout files can reasonably be interpreted as "internal dependencies" without running contrary to Bazel's underlying principles, it would be extremely helpful to users if they could resolve the files "just-in-time" while Bazel builds its internal dependency graph.

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

Within Bazel itself, it looks like the only other reference to sparse-checkout is a feature request to include a static pattern list in the specification of a git_repository(). That request doesn't quite match this one, since 1) that deals with an external repository, rather than the local one, and 2) it still requires pre-determined patterns, rather than resolving files just-in-time.

Outside of Bazel, third-party Bazel/sparse-checkout interoperability tools are pretty common, especially among large monorepo projects (1, 2). However, these tools still rely on a fully on-disk copy of a Git repository to generate the Bazel dependency graph, severely inhibiting the potential performance gains of git sparse-checkout. And, because these integrations use a tool separate from either git or bazel, they are potentially more difficult to adopt than an integration contained in a Bazel rule library.

Any other information, logs, or outputs that you want to share?

I know this is a pretty large feature request (with lots of details to consider: UX, performance, thread safety, etc.), so if this request doesn't seem completely infeasible, I'm happy to write & submit a more detailed design doc!

lberki commented 1 year ago

To manage expectations here, it would take very good arguments for this bug to be fixed (and the associated design doc to be approved)

The current Bazel way is to put the source tree on a FUSE file system and implement laziness that way; this makes it possible for Bazel to be completely agnostic as to how the source tree is put together and this would be both a pretty big change from that approach and a pretty large chunk of functionality for us to support indefinitely that is, in light of that approach, redundant.

@meteorcloudy WDYT? I could imagine considering merging this if there was a widespread need, but for one use case, I'd be reluctant to take on the increased mental burden and support load.

meteorcloudy commented 1 year ago

I agree with your judgment, but I think this is more of a decision to make for the core team.

lberki commented 1 year ago

Then @haxorz WDYT? I thought @meteorcloudy is appropriate because he has all the state about external dependencies in hi head, which is (partially) relevant.

vdye commented 1 year ago

Hello! I'd like to add a bit of context around why I submitted this, what I'm looking for, etc. w.r.t your comments.

For some background: while I'm new to Bazel, I contribute pretty actively on Git (especially git sparse-checkout and the associated sparse index feature) and am generally interested in Git's scalability. Sparse-checkout is great for large repo performance, but since Git isn't a dependency manager (relevant Git mailing list discussion), file inclusion patterns need to be specified manually. Using Bazel with sparse-checkout is pretty common in large monorepos - they both lend themselves to the performance needs of such repos - so using the Bazel dependency graph to construct sparse-checkout patterns is somewhat common.

I spent some time over the past year interviewing monorepo maintainers to learn what their biggest pain points were in Git, and Bazel came up in nearly every conversation. I learned that 1) everyone seems to build their own custom, internal tool to integrate Bazel and sparse-checkout with no standardized approach, and 2) even with those tools, sparse-checkout performance in those repos was substantially worse[^1] than it should be because of the need to have files on disk before running bazel build. So, my goal with this proposal is to find a way to address both of those issues.

The current Bazel way is to put the source tree on a FUSE file system and implement laziness that way; this makes it possible for Bazel to be completely agnostic as to how the source tree is put together

I did see some projects related to this approach (namely sandboxfs) but ultimately leaned away from that sort of integration. Historically, FUSE hasn't been a feasible option for interacting with Git on the scale of repos that need the performance gains of sparse-checkout due to speed, scalability, and/or ease-of-use. Microsoft rolled their own Git-aware VFS with VFS for Git with a greater focus on perf, but that had platform compatibility issues and is now in maintenance mode as a result.

this would be both a pretty big change from that approach and a pretty large chunk of functionality for us to support indefinitely that is, in light of that approach, redundant.

That's an understandable concern, and it's frankly the main reason I spent months evaluating alternative approaches before submitting this issue. But the overhead (both computational and mental) from forcing the use of a VFS layer makes it impractical as a solution in my eyes.

I could imagine considering merging this if there was a widespread need, but for one use case, I'd be reluctant to take on the increased mental burden and support load.

Multiple large projects that use Bazel have built entire engineering systems around coordinating it with Git (there were three talks at Git Merge this year about them! 1, 2, 3). While it might seem like a limited use-case, it's an increasingly common - and critical - one.

Ideally, solving this problem by extending Bazel would make it easier for projects to standardize on a common open-source approach to integrating with Git. A well-designed implementation should minimize the support load added Bazel, while also removing the distributed mental/support burden for all of those other projects' custom tools. Plus, it would lower the barrier to entry for using a performance-optimized Git + Bazel setup for the projects that don't have a dedicated engineering tooling team.

Anyway, thanks for the feedback and I hope this context helps!

[^1]: The two main causes for the performance degradation are the tracking of files on-disk and the use of a full, rather than sparse, index. For example: testing on my local machine with a repo of ~250k files, the timing of git reset --hard was: 3.45s with all 250k files on disk, 0.77s with a full index but only ~17k files on disk, and 0.34s for ~17k files + sparse index enabled. As repositories get larger, the performance impact increases; this blog post does a good job of explaining what's going on under the hood.

lberki commented 1 year ago

Wow, that's an impressive amount of research!

Does the sentence "Bazel came up in nearly every conversation" imply that a majority of the folks who have huge git repositories use Bazel?

My heart is saying that we should have some sort of functionality like this in Bazel, but my head is very reluctant for the reasons already elaborated above. I was about to ask who exactly this would benefit, but you YouTube links more than adequately answer this.

What I will do now is to ask around at Google:

  1. If there are any folks who would benefit from this functionality (then it's a bit easier sell, although the same questions about possible alternatives apply)
  2. If anyone knows of any alternatives

The two alternatives that I know if is a FUSE-based approach we discussed above and a recursive bazel query like bgsc does and what the Twitter folks have.

For the latter, I am somewhat skeptical of the claim that bazel query is slow; at Google, we are using it on pretty large code bases and as long as one can keep the Bazel server alive, it's quite snappy. Do you have a way of proving or disproving these claims?

haxorz commented 1 year ago

[@lberki ] Wow, that's an impressive amount of research!

+1. @vdye, thank you for the incredibly thorough and thoughful FR!

@michaeledgar has a lot of prior experience with sparse checkout in both Git and Mercurial and he's coincidentally on team-Core, so I'd like him to take a look and give his technical opinion.

[@vdye ] ... A well-designed implementation should minimize the support load added Bazel...

+1 to this. Ideally we can come up with something that doesn't involve intrusive changes to Skyframe (Bazel's core execution and incrementality engine). I fear this might be challenging though. But my fear is partially based on ignorance of the details of spare checkouts, so maybe Mike can think of a concrete ~API that is both tasteful and feasible to support.

I'd also like to emphasize the inverse statement: If this FR is to be reified as something very tightly coupled to Git, I don't think team-Core is the right owner (as the lead of team-Core, I wouldn't feel comfortable maintaining it). Also [, pretending I were Bazel's Product Manager, ] I don't think we should add such tight coupling to specific VCS.

lberki commented 1 year ago

Without taking sides, @vdye 's design doc is not coupled to git, so on that side, we're good.

I thought a bit about how this could look like in Skyframe and I think I could come up with a way to make this as unintrusive as possible to Skyframe (essentially: make PackageFunction and BzlLoadFunction fetch the files they care about, or maybe PackageLookupFunction), if git's sparse checkout API is such that it can be done efficiently.

The issues I immediately see are:

lberki commented 1 year ago

cc @maxious @arxanas @dbernadett @linzhp

maxious commented 1 year ago

👋 One of those monorepo maintainers that wrote their own sparse checkout on bazel chiming in.

I am vaguely aware of the google virtual file system (srcfs?) and the way you can implement extended file attributes for hashes. Unfortunately the place where we would want to use this most (developer laptops on macos) has been very shaky in terms of FUSE support/UX in recent years (eg. https://github.com/bazelbuild/sandboxfs/issues/46 ) and even cloud storage providers struggle with the lack of lazy loading with native MacOS remote file system support Recently there has emerged the approach of a local NFS server in place of FUSE which I note Buildbarn has adopted so maybe there's a path forward soon?

All that being said, you still need some service (Content Addressable Storage?) network accessible somewhere and this is easier said than done. Securing and scaling these services takes resources and we already have to consider remote caching and remote execution with bazel in our ecosystem - adding a FUSE file system backend to maintain would be another challenge at less-than-google scale. Hence why I would be supportive of this proposal - in fact we have contributed support for a similar concept to bazel-diff "Allow user to provide their own content hash to minimise IO operation" In our case git would probably make sense as the source of truth - bonus that the hashes are stored locally! Git doesn't have to be the only option by far - the interface between bazel and the filesystem adapter API can be much smaller than having to implement a whole FUSE/NFS implementation that has to be a general purpose filesystem. This would make it easier for the community and vendors to provide practical solutions

sluongng commented 1 year ago

The current Bazel way is to put the source tree on a FUSE file system and implement laziness that way; this makes it possible for Bazel to be completely agnostic as to how the source tree is put together

I did see some projects related to this approach (namely sandboxfs) but ultimately leaned away from that sort of integration. Historically, FUSE hasn't been a feasible option for interacting with Git on the scale of repos that need the performance gains of sparse-checkout due to speed, scalability, and/or ease-of-use. Microsoft rolled their own Git-aware VFS with VFS for Git with a greater focus on perf, but that had platform compatibility issues and is now in maintenance mode as a result.

Hi @vdye,

Big fan of your work over the Git Dev mailing list!

I think https://github.com/bazelbuild/proposals/pull/277 is quite solid and I am in favor of it personally.

However, I would like to note that FUSE usage with Bazel does provide a bit more than just "lazy file loading", which I believe to be the core thesis of git-sparse-checkout and partial-clone(filtered clone) today Perhaps this video from 2sigma folks should help provide some additional insight.

https://www.youtube.com/watch?v=rQd9Zd1ONOw

In short, Bazel relies heavily on the hash of input files in the source tree. Using these hash, Bazel builds up a Merkle tree of dependency graph and relies on it to schedule different build actions. Traditionally, to obtain the hash of files in the source tree, you would need to download the blob to ... hash the content. However, using FUSE, the server operator could provide the pre-compute hash of the file blob at check-in time (or push-received time). This is then provided to Bazel client side via getxattr() call over the FUSE mount. This allows Bazel to obtain the hash of input files without having to download the entire file, saving a lot of network traffic. See #11662 and build-in-cloud-distributing-build-steps (2011) for more information on this.

There are 2 problems trying to re-construct this using git-sparse-checkout today:

  1. In --cone mode, which is now the default mode, all files in an included directory are downloaded. Ideally, we should be able to specify git to only download the BUILD.bazel files and create some customization in Bazel to look into the current commit's tree objects to find the remaining files and their hashes. This could be done with a lot of customization, but definitely is not easy and require Bazel to know a lot more about git and sparse-checkout.
  2. Current default hash algorithm in Bazel is SHA256 with configs to move up to SHA512 in the future. Today, git is still using SHA1 as the default hash algorithm. The initial support for SHA256 does exist in git, but lack of migration/transition support for existing repositories and support from popular hosting software/vendors such as Github, Gitlab, Bitbucket etc... Having the version control system and the build system using 2 different hash algorithms make the integration between the 2 systems a lot harder than it should.

Edit: I wrote this last night and forgot to hit send. I see that @maxious has raised similar concerns in the post above mine 🤝

newren commented 1 year ago

I'm not (currently) a bazel user, and am coming more from the Git side (I'm one of the other Git developers working on sparse-checkout capabilities), but I'd like to throw out another idea for folks to consider:

Add "filesystem adapter" API(s) so that when a source file isn't found on-disk during the loading phase, Bazel can fall back on a user-specified function to vivify the file (if it exists) rather than throwing an error.

Idea:

Would there be an opportunity to just provide Bazel with the contents of the files it needs for computing dependencies instead of vivifying the files?

Reasons:

One thing I was disappointed in with the Git Merge monorepo talks this year was that folks were vivifying these files (or using an additional checkout) rather than allowing the build system to be provided the file contents directly. Granted, vivifying files is what I did for our first cut at using sparse checkouts as well, but we found it rather suboptimal. Providing contents directly to the build system has multiple advantages:

(1) The working directory is kept nice and tidy both for the user and for IDEs -- either of which can get confused by the extra files and directories appearing (perhaps transiently). (2) Providing dependency content directly to the build systems works much more nicely with sparse checkouts: cone mode or the sparse index often mean that vivifying one file means vivifying all files within or underneath a directory, recursively. That's wasted effort when we just want a small control file with dependency information. (3) Providing dependency info directly to the build system reduces I/O (no need to round-trip the dependency information through the filesystem, in addition to no need to write all the extra files in the last item).

I will note that you can work around at least the first issue by e.g. keeping a parallel copy of the checkout maintained; this requires a little cleverness to avoid the performance issues of a full second copy (which is doable), but much more importantly you're introducing the risk of the parallel copy being wrong due to not having local changes made by the user in their working copy (or users going and messing with that copy). Providing the appropriate file contents directly to the build system avoids this problem as well as the above.

I've implemented the contents-directly-to-build-system idea I suggested above. (To get the contents of files just-in-time, for SKIP_WORKTREE files I make use of git cat-file --batch and feed the results to the build system). It has worked great for us, but we are not using bazel so there's not much I can share beyond the high-level experience. Luckily for us, the code changes needed were fairly localized, but it's not clear whether the same would be true for bazel. But, if the API to allow folks to vivify build control files in a just-in-time manner is not much more work than allowing folks to just provide bazel with build control file contents in a just-in-time-manner, then it'd be nice to support the latter too; it makes things much nicer.

vdye commented 1 year ago

Thanks everyone for the commentary/discussion!

@newren: I noted in the full proposal that most Bazel rules (like cc_binary()) rely on having source files on disk, and it would be a massive breaking change to force all of them to accept in-memory source files (some likely couldn't do that even if they wanted to). But what you're suggesting makes a distinction between the "build files" (BUILD, WORKSPACE, *.bzl) needed to construct the dependency graph and "source files" (*.cpp, etc.) used to build the outputs of the project. Separating the needs of those two groups presents some really interesting possibilities!

As far as I can tell, source files aren't needed on-disk until the "Execution" phase of the build, and external rules don't directly access the build files on disk. With those two assumptions in mind, only core Bazel (not individual custom rules) would need to be able to to access build file content from somewhere other than the filesystem (e.g. using git cat-file). With that capability, the complete dependency graph could be built, cached, and queried without either 1) vivifying more files than are needed, or 2) encountering a "package not found" error. The results of bazel query should be all a user needs to create the sparse patterns; a sparse-checkout could then be set up with either an external script or custom Bazel rules/macros (using genquery() to get the dependency graph, then a custom rule to apply the sparse-checkout patterns(?)).

It's probably still more appropriate to add this functionality via an extension API vs. putting git cat-file invocations into core Bazel, but overall I think I prefer it over my original approach. Outside of the infrastructure for a new API & contexts, it's much narrower in scope; these changes should be (mostly) isolated to PackageLookupFunction/PackageFunction/PathPackageLocator(???), and dealing with only build files (rather than the entire source tree) alleviates a lot of my concerns with locking in git sparse-checkout & thread safety. That said, there'd be some trickiness to how the build files are retrieved from Git (to avoid expanding a sparse index more than necessary), and the potential issues @lberki raised earlier also still apply. I need to ponder those a bit more, but my gut feeling is that they can be handled without any invasive architectural changes.

@lberki @michaeledgar (or anyone, really) - what do you think? I'd like to start working on an implementation, but don't want to get too far with it in case there's a much easier approach that hasn't come up yet.

arxanas commented 1 year ago

As far as I can tell, source files aren't needed on-disk until the "Execution" phase of the build, and external rules don't directly access the build files on disk.

I think there's a problem when globbing source files, since then the rules depend on the presence of source files on disk (even if not their contents)?

newren commented 1 year ago

But what you're suggesting makes a distinction between the "build files" (BUILD, WORKSPACE, *.bzl) needed to construct the dependency graph and "source files" (*.cpp, etc.) used to build the outputs of the project. Separating the needs of those two groups presents some really interesting possibilities!

Yes, absolutely. It only made sense to handle the "build files" that special way; the "source files" were absolutely required to be on disk for the other build system too.

One thing to note, though, is that our build system was okay with just providing {module/package}-level dependencies, which was what we needed to get the directories to specify for the sparse-checkout configuration. If we weren't using cone mode and were attempting to exclude all unnecessary files, or if the system had been hardcoded to only determine file-level dependencies (which is often determined by globbing all source files under a directory), then our changes to the build system would have needed to be more invasive. I think it's a bad idea to try to force sparse-checkouts into the minimum list of needed files and much prefer the idea of the minimum set of packages, but the question of package-level vs file-level dependencies might come up so I thought I'd note it.

It's probably still more appropriate to add this functionality via an extension API vs. putting git cat-file invocations into core Bazel

:+1:

newren commented 1 year ago

As far as I can tell, source files aren't needed on-disk until the "Execution" phase of the build, and external rules don't directly access the build files on disk.

I think there's a problem when globbing source files, since then the rules depend on the presence of source files on disk (even if not their contents)?

I think this goes back to the question I also noted above: is it possible to determine package-level dependencies without determining all file-level dependencies? We only need the former for sparse checkouts.

lberki commented 1 year ago

Update: we talked with @linzhp in person at BazelCon and I offered him the option of building a bazel query operator which could be used to do the following:

  1. One identifies the top-level targets one wants to build
  2. One calls bazel query magicdeps(:target, 1) to identify the packages that need to be on disk to be able to load the direct dependencies of :target
  3. One checks out these packages using git sparse checkout
  4. One calls bazel querymagicdeps(:target, 2)to identify the packages that need to be on disk to be able to load the second-level dependencies of:target`
  5. And so on, until a fixed point is reached.

I don't think this is doable with bazel query now, but it's certainly possible to build (maybe not with bazel query)

This system could be used to do sparse checkouts with Bazel without deep integration like what @vdye proposed. The user experience would not be as nice, but, it would widen the interface of Bazel much less than that proposal and it would be a much smaller ongoing maintenance burden.

Would this be an option?

vdye commented 1 year ago

Would this be an option?

This approach is almost identical to one of the ones I considered before opening this issue ("create a mode/option allowing bazel query to print out dependencies that are declared by a resolved target, but cannot be resolved themselves"). I decided against continuing down that path because it loses one of the core tenets of this proposal, which is that a user shouldn't need external tooling (in this case, a script to call bazel query magicdeps(...) & sparse-checkout iteratively) to manage the coordination between Bazel and Git. What you're suggesting may be a valuable feature, but it doesn't solve the developer workflow/ease-of-use problem.

In the meantime, I've been working on a draft implementation based on the alternative approach suggested by Elijah earlier. With respect to maintenance burden, I haven't had to do much in the way of refactoring or changing core infrastructure; the virtual file resolution is limited to a new SkyFunction + some conditionals in PackageLookupFunction and PackageFunction, plus some boilerplate-y Starlark builtin setup for dependency_adapter. I've been gradually whittling down the scope in the process (not using an adapter when loading packages in external repositories, the implementation is expected to resolve virtual symlinks if there are any, no setup/teardown functions, etc.). I'd like to at least get to a basic end-to-end working demo/draft PR so that it can be evaluated concretely, probably in the next couple weeks.

lberki commented 1 year ago

@vdye : yeah, it's not a great idea I had, it's a preference of mine between two of your ideas :) I do realize that it would not be as nice a user experience, but it would be much easier to support indefinitely. (Thus spake the Master Programmer: "Though a program be but three lines long, someday it will have to be maintained.")

I'll resist the temptation to pass judgement on the approach you have taken because it's much easier to have an informed opinion when I see the code. From what I can glean from your above comment, it seems to work by teaching PackageFunction / PackageLookupFunction that the root of some packages is different than what is in --package_path, but then how is it not "vivifying the files"? How does it ask git for the data it needs?

Bazel has a pretty well-developed virtual file system layer, which sounds like another possible abstraction layer to plug this in. I don't know if it's easier, but it's certainly a possibility and we have a number of alternative FileSystem implementations at Google, so it's an approach that certainly works.

On the git side, what is the future of non-cone mode? Will it be supported? From a brief perusal of the documentation, it looks like that cone mode is now preferred, which is unfortunate because then AFAIU if one wants to checkout e.g. package a, packages a/b, a/b/c and every package under a/ will come with it, which is potentially quite wasteful.

Only checking out BUILD / BUILD.bazel files for the packages one needs is not enough for the following reasons:

  1. BUILD files can load Starlark files
  2. Globs need to do readdir() and readlink()
  3. The meaning of package a can depend on the existence of BUILD files under it. E.g. the label //a:b/c is valid only if a/b/BUILD does not exist (if it does, the same file is called //a/b:c)
lberki commented 1 year ago

To add to my previous comment: I can see how piggybacking on PackageLookupFunction and shelling out to an external binary (in some form) would be a solution. It has a number of issues, though, that are non-trivial to solve:

  1. How do you cache the non-existence of packages? When Bazel sees (for example) //a:b/c, it has to verify that a/b/BUILD does not exist or if one says bazel build //nonexistent:package, Bazel must learn somehow that nonexistent/BUILD and nonexistent/BUILD.bazel do not exist and a file system where that does not exist (but Bazel tried) looks the exact same as a file system where Bazel hasn't even tried to access that directory.
  2. What happens on a Bazel server restart? The issues in (1) are technically not a problem as long as the Bazel server is alive because Bazel caches the existence (or non-existence) of a package in RAM, but if you to bazel shutdown (or the server dies for any other reason), that information is lost.
  3. Symbolic links also need to be thought about, but it appears that you have a plan for them?
  4. What happens if you have two Bazel instances on the same output tree (if you pass --output_base to Bazel, you can do that and people do)
  5. How do you control parallelism? If you make PackageLookupFunction slow, it's more likely that there will be many of those in flight at any given moment, so this needs to be thought about even if the solution is "document that the invoked script must be able to run in multiple instances at the same time"
  6. What happens if one says filegroup(srcs=[<directory>]) and that directory happens to contain a BUILD file somewhere under it? Bazel then currently just puts everything under the directory (crossing BUILD files) on the inputs of actions in this case. Granted, this use case is one that Bazel does not support very well at the moment and in fact leads to incorrect behavior but we want to support it eventually in some form and it would be nice if this feature didn't add to the cost of that one. Like in the parallelism case, I'd be fine with any quick solution that doesn't make the existing problem any worse. Hell, I could be even convinced to ignore this one because we've been ignoring it for almost a decade.
vdye commented 1 year ago

@lberki

yeah, it's not a great idea I had, it's a preference of mine between two of your ideas

I wasn't looking to take credit, just clarify that I'd deeply considered the pros/cons of what you suggested and note the conclusion I had reached.

it would be much easier to support indefinitely.

I detailed earlier why third-party tools are insufficient from a usability perspective. Tools that do what you’re suggesting already exist (most similarly, bgsc), but this proposal is explicitly meant to move beyond their limitations. While Google's workflow may not have a use for it, there are plenty of other Bazel users (including some of the people in this thread!) with completely different infrastructure & workflows that would.

You’ve mentioned maintainability a number of times now, but it would be extremely helpful for me to know what you would quantify as “too complex” or “too invasive.” Are there specific things I need to avoid doing in my implementation that have burned the Bazel team in the past? I’d be much more comfortable working within clear guidelines than leaving the “go/no go” call up to (from my perspective) chance.

it seems to work by teaching PackageFunction / PackageLookupFunction that the root of some packages is different than what is in --package_path, but then how is it not "vivifying the files"? How does it ask git for the data it needs?

Per Elijah's comment:

Would there be an opportunity to just provide Bazel with the contents of the files it needs for computing dependencies instead of vivifying the files?

So it's not vivifying the files, it's reading the files directly from the Git index into memory for Bazel to compile (via git cat-file). As I have it now (which, keep in mind, is a work-in-progress), PackageFunction sends a request to PackageLookupFunction, but denotes that the BUILD file may be virtual (via a new, default-false field in the SkyKey). If it is, and PackageLookupFunction does not find a particular BUILD file on-disk, it sends a request to the new VirtualBuildFileFunction, which first loads & checks for the existence of the dependency_adapter(), then invokes the implementation function, then compiles the returned BUILD file contents and stores the CompiledBuildFile in a SkyValue. PackageLookupFunction uses those results to say "hey this package does exist (but it's virtual)!", then PackageFunction uses the CompiledBuildFile from the result in place of reading from disk & compiling.

Note that this approach has two limitations: it can only run successfully in commands that don't touch the non-BUILD source files (so, bazel query), and it doesn't perform the on-disk source vivification. The user workflow would either be to pipe the results of of the appropriate bazel query to git sparse-checkout set, or write a parallel chain of rules (genquery() to get the dependencies, then genrule() to invoke sparse-checkout). Importantly, both avoid the need for an external script.

Like I said, though, all of this is still WIP, so nothing is set in stone. For example, one thing I'd like to try is to skip the request to VirtualBuildFileFunction altogether by default, and only enable dependency_adapter usage at all with a feature flag.

I currently have a functioning end-to-end MVP, but it almost certainly breaks outside of the “happy path” example I’m working with. I’m planning on spending a week (not next, but the one after) cleaning that up, then will open a draft pull request so people can look it over.

Bazel has a pretty well-developed virtual file system layer, which sounds like another possible abstraction layer to plug this in. I don't know if it's easier, but it's certainly a possibility and we have a number of alternative FileSystem implementations at Google, so it's an approach that certainly works.

This approach would likely require hardcoding Git logic into Bazel (which I don't think anyone here wants) and the FileSystem layer is responsible for all filesystem accesses, so any integration would need to distinguish between in-repo file reads & everything else. I think that'd be far more invasive and complicated than anything I'm proposing here.

On the git side, what is the future of non-cone mode? Will it be supported? From a brief perusal of the documentation, it looks like that cone mode is now preferred, which is unfortunate because then AFAIU if one wants to checkout e.g. package a, packages a/b, a/b/c and every package under a/ will come with it, which is potentially quite wasteful.

Repositories that need the scalability of cone mode are going to tend towards rearchitecting their package structure to best take advantage of it. Using your example, a cone mode repository like that would either intend for a, a/b, and a/b/c to always be needed together (and therefore checked out together), or would rework the directory structure to avoid what’s not needed.

Note that the dependency adapter proposal doesn’t mandate one of cone mode or non-cone mode (although it is designed with cone mode use cases in mind). A Git adapter would pull BUILD file information directly from the index without touching sparse-checkout patterns. There are some minor details about how to manage expanding a sparse index, but they’re very easily overcomeable.

Only checking out BUILD / BUILD.bazel files for the packages one needs is not enough for the following reasons:

  1. BUILD files can load Starlark files

  2. Globs need to do readdir() and readlink()

  3. The meaning of package a can depend on the existence of BUILD files under it. E.g. the label //a:b/c is valid only if a/b/BUILD does not exist (if it does, the same file is called //a/b:c)

Some of these (EDIT: and a few of the additional ones you mentioned in your most recent comment) aren't supported by my prototype in its current state, but 1) I intend to at least evaluate how much work they’d be and adjust my design accordingly, and 2) if they’re non-trivial to support, document why. Ultimately, I don't think it's unreasonable to say "the dependency_adapter() won't (yet) work if you're also using xyz feature," as long as that's balanced against supporting enough features to be generally useful.

I personally find it most beneficial to introduce and incrementally expand a capability over time, rather than to wait just to make sure it's compatible with every possible workflow. That's especially true here, as the feature is opt-in and, if a user tries something and it doesn’t work, that can inform where incremental improvements could be made. In the meantime, users could either adjust their build infrastructure to fit the restrictions, or continue working without the feature at all.

maxious commented 1 year ago

Repositories that need the scalability of cone mode are going to tend towards rearchitecting their package structure to best take advantage of it.

I'd echo this sentiment. We already enforce conventions against patterns in Bazel (and other tools) that are incompatible with sparse checkouts.

lberki commented 1 year ago

@lberki

yeah, it's not a great idea I had, it's a preference of mine between two of your ideas

I wasn't looking to take credit, just clarify that I'd deeply considered the pros/cons of what you suggested and note the conclusion I had reached.

Neither did I so I thought it's better to clarify. It's all good!

it would be much easier to support indefinitely.

I detailed earlier why third-party tools are insufficient from a usability perspective. Tools that do what you’re suggesting already exist (most similarly, bgsc), but this proposal is explicitly meant to move beyond their limitations. While Google's workflow may not have a use for it, there are plenty of other Bazel users (including some of the people in this thread!) with completely different infrastructure & workflows that would.

You’ve mentioned maintainability a number of times now, but it would be extremely helpful for me to know what you would quantify as “too complex” or “too invasive.” Are there specific things I need to avoid doing in my implementation that have burned the Bazel team in the past? I’d be much more comfortable working within clear guidelines than leaving the “go/no go” call up to (from my perspective) chance.

I can't think of an objective measure, unfortunately. If I had, I would have already informed you about it; what I'm looking for is some combination of:

Does this help?

it seems to work by teaching PackageFunction / PackageLookupFunction that the root of some packages is different than what is in --package_path, but then how is it not "vivifying the files"? How does it ask git for the data it needs?

Per Elijah's comment:

Would there be an opportunity to just provide Bazel with the contents of the files it needs for computing dependencies instead of vivifying the files?

So it's not vivifying the files, it's reading the files directly from the Git index into memory for Bazel to compile (via git cat-file). As I have it now (which, keep in mind, is a work-in-progress), PackageFunction sends a request to PackageLookupFunction, but denotes that the BUILD file may be virtual (via a new, default-false field in the SkyKey). If it is, and PackageLookupFunction does not find a particular BUILD file on-disk, it sends a request to the new VirtualBuildFileFunction, which first loads & checks for the existence of the dependency_adapter(), then invokes the implementation function, then compiles the returned BUILD file contents and stores the CompiledBuildFile in a SkyValue. PackageLookupFunction uses those results to say "hey this package does exist (but it's virtual)!", then PackageFunction uses the CompiledBuildFile from the result in place of reading from disk & compiling.

Note that this approach has two limitations: it can only run successfully in commands that don't touch the non-BUILD source files (so, bazel query), and it doesn't perform the on-disk source vivification. The user workflow would either be to pipe the results of of the appropriate bazel query to git sparse-checkout set, or write a parallel chain of rules (genquery() to get the dependencies, then genrule() to invoke sparse-checkout). Importantly, both avoid the need for an external script.

Like I said, though, all of this is still WIP, so nothing is set in stone. For example, one thing I'd like to try is to skip the request to VirtualBuildFileFunction altogether by default, and only enable dependency_adapter usage at all with a feature flag.

I currently have a functioning end-to-end MVP, but it almost certainly breaks outside of the “happy path” example I’m working with. I’m planning on spending a week (not next, but the one after) cleaning that up, then will open a draft pull request so people can look it over.

Ah, got it; I misunderstood that, but now I am confused: you do have a point that the magicdeps() approach I offered is not a very great user experience because then running bazel build is not enough to actually build software because then the wrapper using magicdeps() also need to be called, but AFAIU this approach is only usable for bazel query, so it makes the arguably most important user workflow (bazel build) more complicated, too, by requiring a wrapper to check out the actual source files anyway.

Even limiting ourselves to bazel query, can this approach ever support globs? Globs need to call readdir() during package loading.

Bazel has a pretty well-developed virtual file system layer, which sounds like another possible abstraction layer to plug this in. I don't know if it's easier, but it's certainly a possibility and we have a number of alternative FileSystem implementations at Google, so it's an approach that certainly works.

This approach would likely require hardcoding Git logic into Bazel (which I don't think anyone here wants) and the FileSystem layer is responsible for all filesystem accesses, so any integration would need to distinguish between in-repo file reads & everything else. I think that'd be far more invasive and complicated than anything I'm proposing here.

I agree, but if we don't want the vivifying approach, I'm not sure if the "plumb the bits directly from git to Bazel" approach is workable at all (globs + the source files are eventually necessary for the build) without dropping down to the FileSystem layer.

On the git side, what is the future of non-cone mode? Will it be supported? From a brief perusal of the documentation, it looks like that cone mode is now preferred, which is unfortunate because then AFAIU if one wants to checkout e.g. package a, packages a/b, a/b/c and every package under a/ will come with it, which is potentially quite wasteful.

Repositories that need the scalability of cone mode are going to tend towards rearchitecting their package structure to best take advantage of it. Using your example, a cone mode repository like that would either intend for a, a/b, and a/b/c to always be needed together (and therefore checked out together), or would rework the directory structure to avoid what’s not needed.

Note that the dependency adapter proposal doesn’t mandate one of cone mode or non-cone mode (although it is designed with cone mode use cases in mind). A Git adapter would pull BUILD file information directly from the index without touching sparse-checkout patterns. There are some minor details about how to manage expanding a sparse index, but they’re very easily overcomeable.

I understand that; what I wanted to know is whether I can expect non-cone mode to be supported in git in the long term because if not, we'd have to just accept that a/b will always be checked out if a is. From the documentation, it seemed that sparse checkouts are still experimental and non-cone mode is doubly experimental and it looked like non-cone mode is necessary to support the above a but not a/b use case. At least @newren 's comment implies that not everyone arranges their source trees in such a way that if a is needed, a/b also is.

Only checking out BUILD / BUILD.bazel files for the packages one needs is not enough for the following reasons:

  1. BUILD files can load Starlark files
  2. Globs need to do readdir() and readlink()
  3. The meaning of package a can depend on the existence of BUILD files under it. E.g. the label //a:b/c is valid only if a/b/BUILD does not exist (if it does, the same file is called //a/b:c)

Some of these (EDIT: and a few of the additional ones you mentioned in your most recent comment) aren't supported by my prototype in its current state, but 1) I intend to at least evaluate how much work they’d be and adjust my design accordingly, and 2) if they’re non-trivial to support, document why. Ultimately, I don't think it's unreasonable to say "the dependency_adapter() won't (yet) work if you're also using xyz feature," as long as that's balanced against supporting enough features to be generally useful.

I personally find it most beneficial to introduce and incrementally expand a capability over time, rather than to wait just to make sure it's compatible with every possible workflow. That's especially true here, as the feature is opt-in and, if a user tries something and it doesn’t work, that can inform where incremental improvements could be made. In the meantime, users could either adjust their build infrastructure to fit the restrictions, or continue working without the feature at all.

I promise I won't hold you to impossible standards :) What I'm looking for here is not a perfect implementation, only an implementation that has a reasonable chance of eventually supporting every use case. As long as that is true, it can always be put behind a --experimental_* flag.

Comparing various alternatives against the above simple interface / simple implementation / no surprising interactions standard:

lberki commented 1 year ago

@vdye this thread is now long enough that I think a video call would be more productive. Which time zone do you live in?

I'm at BazelCon this week and my calendar is completely full, but we could arrange a call next week, which hopefully let us come to an agreement earlier. Until then, I guess a centithread it is?

manuelnaranjo commented 1 year ago

With all the talks in BazelCon around wrappers built on top of bazelisk and the plugin model they provide, couldn't we just use the magic method approach and do all the git dancing in there? That way we can all benefit from it while keeping the internals clean. Last year when we looked into partial cloning at Booking we found similar issues, having something like this available could really benefit us tbh

vdye commented 1 year ago

With all the talks in BazelCon around wrappers built on top of bazelisk and the plugin model they provide, couldn't we just use the magic method approach and do all the git dancing in there? That way we can all benefit from it while keeping the internals clean. Last year when we looked into partial cloning at Booking we found similar issues, having something like this available could really benefit us tbh

Given that bazelisk is an "official" wrapper for Bazel, this might address some of my concerns about third party tooling (namely, ease of adoption & standardization). I haven't looked into the plugin model there, but it doesn't sound too complicated? I can try to quickly put something together to prove it out. Thanks @manuelnaranjo!

EDIT: actually, I have no idea where to start with bazelisk (I'm not seeing any official documentation on a plugin model). Does someone have a link to either an example or docs?

@lberki I'm not around next week, but I am the week after (time zone UTC−08:00). That said, I'd like to try this bazelisk idea out first, so we might be able to wait on setting up a call?

manuelnaranjo commented 1 year ago

@vdye they're built on top, they mentioned 2 in the conference https://github.com/aspect-build/aspect-cli and https://github.com/buddy-works/buddy-cli

vdye commented 1 year ago

@vdye they're built on top, they mentioned 2 in the conference https://github.com/aspect-build/aspect-cli and https://github.com/buddy-works/buddy-cli

Hmmm, I must have misinterpreted then. I assumed the "plugin model they support" was referring to bazelisk supporting plugins, rather than the wrappers on top of it. That brings us back to "you need to manually install some third-party tool (in this case, the wrapper) from outside the Bazel or Git ecosystems," and all the drawbacks associated with that.

One thing I wanted to mention, since the bazelisk idea reminded me of it: if there was a way to use Bazel's existing extensibility APIs (custom rules or macros or some other kind of plugin) to define a magicdeps() iteration, I'd be all for doing this that way. But AFAICT that isn't possible (someone please correct me if I'm wrong), which is how I ended up with a proposal to add a new API.

linzhp commented 1 year ago

@manuelnaranjo From user experience's perspective, the magicdeps() approach is not better than the current approach we have at Uber (see my Git Merge talk). After each rebase/merge from the main branch, users still have to run a third tool to refresh their sparse work tree so all deps are checked out. The differences are:

  1. magicdeps() avoids maintaining a full checkout, which would save some disk space
  2. Because we don't need a full checkout, we don't need to replicate state from the sparse work tree, which can save some time
  3. Every time users want to refresh, the tool would need to run bazel query and git sparse-checkout set many times, depending on the number of layers in the deps.

3 could quickly eat up the time saved from 2. It's still nice to have, though.

Currently, there is another issue with magicdeps() approach: if any bzl file loaded by WORKSPACE is missing, bazel query would fail, even the bzl file is not needed by the target pattern given to bazel query. Not sure how hard it is to fix this.

lberki commented 1 year ago

Yeah, I am aware of the limitations here and that it currently cannot be done with bazel query. It also has the problem that currently bazel query can't tell without erroring out which .bzl files are required by a BUILD file. But a magicdeps() version that works sounds like a reasonable and well-delineated addition to Bazel.

re: @vdye 's comment that this would then require a third-party tool outside of the Bazel and git ecosystem, I thought that there was no way around that: I don't want to embed knowledge about git into Bazel, you presumably don't want to embed knowledge about Bazel into git, that knowledge must live somewhere so a third place is the only option, isn't it?

@newren : what is suboptimal about the vivifying approach? If magicdeps() doesn't work, that one looks like a reasonably small feature which could be made to work (modulo symlinks), unlike the FileSystem approach (which as @vdye said, would be quite an undertaking to implement) and the git cat-file approach which AFAICT is very difficult to implement correctly, if possible at all.

michaeledgar commented 1 year ago

I've caught up with the entire thread! I want to try to get to everything --

WORKSPACE can't load because not all transitive .bzl files are present

I don't think we can make do with an invalid workspace - I would treat the WORKSPACE contents and everything it references as "toolchain" source that simply must be present for Bazel to behave in a predictable manner.

I can followup and check if we produce BEP that could be used to determine which .bzl file was missing. Then you could maybe write a wrapper script that fetches the required repo based on the BEP output (the magicdeps() loop idea).

Build/test target can't load because not all transitive .bzl files are present

We currently need all the transitive .bzl files to be available, even if some of them are not relevant for the build. (For example, a BUILD file containing a foo_library(), foo_binary(), and foo_test(). Even if you only ask to build //foo:foo_bin, Bazel will load the foo_test symbol from a foo_test.bzl.) We should fail with an error in BEP - I'll followup on that too.

It's not true today, but it's conceivable that we could be tolerant to the failure to load rule definitions not in the transitive deps of the requested build targets. Typically it's easier to just partition unrelated rule definitions into parallel directory trees.

How to download fewer bytes

Google's monorepo tooling is famously based on virtual filesystems where tools may assume all files are available, and the filesystem lazily-loads the source content of unmodified files as needed. This is the way we download the fewest .bzl files during loading at scale, so that's the path of least resistance.

As has been discussed, to run bazel build //myteam:foo_bin the build tool may need to do filesystem operations outside of the myteam directory. It may read .bzl files, check for BUILD existence.

There's two broad categories of solutions here: FUSE and non-FUSE.

FUSE Solution

I'm not familiar with any non-Googley FUSE filesystems. But. If there is a read-only FUSE filesystem for your VCS, I think the following approach is essentially what Git users at Google used for many years. You use the VCS to manage the files you want checked out, and wrapper scripts maintain a symlink to a read-only FUSE filesystem that provides a consistent baseline snapshot of all other files. This solution looks like this:

  1. At boot, mount a read-only FUSE filesystem for the repo at /company/src, such that some path (eg. /company/src/$VERSION) yields the root tree of the repo at $VERSION.
  2. Change monorepo tooling to maintain a symlink at your workspace's root called REPO that points to /company/src/$VERSION.
  3. Add the REPO symlink to --package_path in your site blazerc:
    $ readlink REPO
    /company/src/a2604cdcce22f1337b582dfc093851d2015c0c15
    $ cat .bazelrc
    build --package_path=%workspace%:%workspace%/REPO
    clean --package_path=%workspace%:%workspace%/REPO
    fetch --package_path=%workspace%:%workspace%/REPO
    info --package_path=%workspace%:%workspace%/REPO
    query --package_path=%workspace%:%workspace%/REPO

Any BUILD/bzl files not found in the workspace will be retried under /company/src/a2604cdcce22f1337b582dfc093851d2015c0c15.

If you're patching together a bunch of repos, you can do this with multiple symlinks to different mount points, though they will be checked sequentially, so loading performance may degrade if the additional failed FUSE filesystem reads are slow.

I'm not 100% sure if we're incrementally correct with changes to the REPO symlink - it would likely depend also on details of the FUSE filesystem, eg. mtimes.

Custom FileSystem paths carved out for non-FUSE, on-demand reads

In this scenario, a new Bazel FileSystem is introduced that can resolve files on-demand (somehow) without implementing an OS-level filesystem (ala FUSE). The version being queried might be in-band in the path or out-of-band in a flag. This would primarily be useful for host environments where FUSE is not available.

For example, the same symlink pattern would be used as before, only with fake absolute paths carved out by new configuration settings:

$ ln -s /virtual/git/github.com/bazelbuild/bazel/<sha> REPO
$ grep git_virtual_repo WORKSPACE
git_virtual_repo(prefix = "/virtual/git/")  # errors if /virtual/ exists to encourage unique prefixes
$ grep REPO .bazelrc | head -n 1
build --package_path=%workspace%:%workspace%/REPO

This would require some Git logic in Bazel, but the basic Git HTTP protocol isn't so bad to be honest. I am a Mercurial user myself and would want to support Mercurial as well. Really it should be able to read files 1-1 from any HTTP object source, eg. S3-compatible storage, as long as the source snapshot is consistent. It sounds like a WORKSPACE feature, and it would be vulnerable to scope creep to capture basically all the ideas discussed that I've otherwise not mentioned. I'm not sure I totally buy into it just yet, it would be a heavy lift, but if there's a lot of hosts that can't use FUSE, maybe we should consider it.

newren commented 1 year ago

@lberki : I'll try to answer your questions as best I can, and provide my experience and perspective on why I find certain things important or unimportant...

I understand that; what I wanted to know is whether I can expect non-cone mode to be supported in git in the long term because if not, we'd have to just accept that a/b will always be checked out if a is. From the documentation, it seemed that sparse checkouts are still experimental and non-cone mode is doubly experimental and it looked like non-cone mode is necessary to support the above a but not a/b use case. At least @newren 's comment implies that not everyone arranges their source trees in such a way that if a is needed, a/b also is.

non-cone mode came first, but it is deprecated: https://git.kernel.org/pub/scm/git/git.git/commit/?id=a8defed07c

It is true that some people still use non-cone mode. However, it has lots of problems (which is why we deprecated it), we think it cannot pragmatically be supported beyond its original basic feature set, and we've pointed out that it has a bunch of gotchas even on the original basic feature set that cannot be fixed. We've already added new features (such as the sparse-index) which are incompatible with non-cone mode.

If we as Git developers say that non-cone mode is broken and isn't worth supporting (or maybe isn't possible to support) for new features, why should a project like bazel try in cases where it's non-trivial for them to do so?

But, even more importantly:

I think it's actively a bad idea to try worry about file-level partial checkouts even if you could easily implement it; doing so is likely so computationally expensive that, given how often it must be done, it would defeat the whole point of the exercise: speed. The point of letting users work on a subset of the repository is so that common operations are faster. Computing fine-grained full tree file-level dependencies is likely not fast.

  • I expect that the git cat-file approach would to one of these ways: it is either a bad user experience because it breaks some Bazel features (globs, bazel build, etc.) or its implementation would be quite unwieldy

You seem to be presuming that full file-level dependencies must be implemented for this feature, or this feature should not implemented at all. I'm a bit confused by that. We don't want full file-level dependencies; we only want package-level dependencies.

@newren : what is suboptimal about the vivifying approach? If magicdeps() doesn't work, that one looks like a reasonably small feature which could be made to work (modulo symlinks), unlike the FileSystem approach (which as @vdye said, would be quite an undertaking to implement) and the git cat-file approach which AFAICT is very difficult to implement correctly, if possible at all.

The point of the feature being asked is to allow vivifying (or unsparsifying) the necessary packages, but only those. If you vivify other things (even if only a few files from other packages), then you run the risk that users, IDEs, and/or scripts can get confused. (As a couple examples, users may ask why they have unnecessary directories, or why "git grep" is turning up hits in places they weren't working on and "didn't want checked out"; IDEs may attempt to auto-build stuff that isn't part of the user's focus.) However, if it's only a few files from otherwise unwanted packages, you may still achieve the overall goal of allowing users to work with a subset of their repo quickly, so you might have a workable even if suboptimal solution.

In contrast, if determining package dependencies requires first checking out everything (i.e. to unsparsify completely), then sparse-checkouts end up costing more than they save. unsparsifying and resparsifying is a very expensive operation. And we need to potentially re-determine dependencies with every merge/rebase/reset/switch-branch/etc. (If any of the build control files -- .bzl files I think in bazel's case, .metaconf cases in the case I was dealing with -- are newer than $GIT_DIR/info/sparse-checkout, then the dependencies could have changed and we need to determine the new package-level dependencies and update the sparse-checkout patterns appropriately). In fact, we did have some scripts that only a few users had to use, which required the unsparsify-then-run-script-logic-then-resparsify dance. Those users all ended up abandoning sparse-checkouts unless we found a way to remove the need to unsparsify -- usually by either rewriting the logic in it, or moving it to CI systems so they didn't need to run it locally.

But it's not just unsparsifying and resparsifying that can be expensive. I'm also a bit worried about the dependency calculation step used to update the sparse checkout patterns. Since we have to perform that check often (with potentially every merge/rebase/reset/switch-branch/etc), it was important in our case to make sure we only computed package dependencies and not full file dependencies. The latter would have been so excruciatingly slow that I think it would have been a deal breaker. And from what I saw at the Git Merge conference, those using bazel are working on even bigger monorepos than we were, so I would suspect the problem would be even bigger for them.

newren commented 1 year ago
  • How to download fewer bytes

While downloading fewer bytes is important for many folks, there are monorepo folks who have a performance profile where a dense clone is a reasonable cost but they really want sparse-checkouts. Not everyone is Google/Microsoft/Facebook-sized.

Build/test target can't load because not all transitive .bzl files are present

We currently need all the transitive .bzl files to be available, even if some of them are not relevant for the build. [...] It's not true today, but it's conceivable that we could be tolerant to the failure to load rule definitions not in the transitive deps of the requested build targets. Typically it's easier to just partition unrelated rule definitions into parallel directory trees. [...]

Custom FileSystem paths carved out for non-FUSE, on-demand reads

In this scenario, a new Bazel FileSystem is introduced that can resolve files on-demand (somehow) without implementing an OS-level filesystem (ala FUSE).

I get the feeling that the following two ideas are foreign to bazel:

Is that correct? Would introducing those two ideas be too high an impedance mis-match with how bazel works?

All that said, I'm not working on this feature, Victoria is. I'm only giving context from my role as a Git developer, and as someone on a team who implemented this idea for a different (internal) build system.

haxorz commented 1 year ago

I've been asked to assign a Priority label for this issue. Given that the design discussion is still ongoing, I think P4 is the most appropriate right now. But don't let that label be commentary on this very interesting discussion :)

vdye commented 1 year ago

After a lot of consideration, I’m going to pause my efforts towards getting upstream acceptance of this functionality. I’d love it if the Bazel team could this issue open for people to find and lend their perspectives but, at the moment, we seem to have reached an impasse on the necessity and approach to this feature.

For anyone still interested, I plan/hope to keep working on an adapter API in my fork, and I’ll keep an eye on this thread in case there’s something valuable I can add to the discussion. Thanks!

lberki commented 1 year ago

@newren : you are wrong on both counts :) It's quite natural for Bazel to build a subset of the code base; it's just that currently, determining the particular subset one needs to build anything particular in the code base is more complicated than it should be. That's something that would be much easier to implement than this issue either with @vdye 's design or something else.

As for determining what part of the code base is needed to build "X and Y", it's called bazel query 'deps(X + Y)', it's just that it doesn't work very well if the full code base is not there (it could be made work and people did, but it's more complicated than it should be)

@vdye : understood and acknowledged, with some sadness. It feels like there is a nice design wanting to come into existence, but between the simplicity of implementing easier dependency discovery, FUSE (or FUSE-by-NFS, as seems to be in fashion on Mac OS these days) which limit the potential benefits and the difficulty of coming up with an interface that lets you do what you want without excessive collateral damage to Bazel, I think this is the right decision, at least for the time being.

I'll then close this issue and move the design doc to "dropped" once I'm back to my usual terminal.

lberki commented 1 year ago

https://github.com/bazelbuild/proposals/pull/281 sent out.

lberki commented 1 year ago

It looks like Facebook is also going the FUSE route:

https://github.com/facebook/sapling/blob/main/eden/fs/docs/Overview.md

lberki commented 1 year ago

Reopening due to @vdye's ask above (sorry, I skimmed over that sentence)

As I said above, I do think there is a way forward and I do think that there is an unmet need, it's just that the upsides are limited because the alternatives work to some degree.

sluongng commented 1 year ago

Here are my takes on this:

Recap

I think there is a spectrum of repositories sizes, and thus there is a spectrum of solutions for Bazel to consider enabling those use cases.

image

In my mind, the red line above is the distribution of repositories that are and going to use Bazel today, or in the near future. Among these, there are 3 size categories to consider:

  1. The small repos: these don't need much help as everything simply just work

  2. The enterprise monorepos: these start reaching the size of Linux Kernel (biggest public git repo) and there are a lot of pain in hosting them and integrating them with Bazel.

  3. The big tech monorepo: these are so large they run on customize infrastructure, one may consider AOSP to be among this category (with repo tool stitching multiple git sub-modules) but there is also Meta's and google3.

We know that for (3), FUSE is much needed. But behind FUSE is a colossal infrastructure setup: client, server, cloud workspace, custom vcs integration. The cost of ownership for a FUSE solution is huge. It will come down in the future as more companies reach that scale and if there are known open source solutions. But for non-big-tech company, the cost of running such a solution is way to high today.

For (2), what we can observed from Canva, Twitter and Uber in Git Merge 2022 conference is that git-sparse-checkout is a good convergent point. To get here, git-sparse-checkout has received a ton of love in recent years from Microsoft and Github folks. However, to make this work seamlessly with Bazel is still a long way to go.

What to do from here

I think in recent Git contributor submit, @jrn from Google's git team (who supporting AOSP) has expressed interest in building a "git-aware filesystem" VFS of sort: http://public-inbox.org/git/YzXwUEre3PP22FDU@nand.local/. Given that AOSP is also adopting Bazel, I wonder if there will be an open-source FUSE file system which could integrate with Bazel anytime soon? https://github.com/google/slothfs is quite stale and seems unused currently. cc: @nasamuffin may be you could help comment on this.

If there is a mature FUSE solution equipped with Bazel + Git integration to come out soon, I would suggest we go ahead and close this issue in hope that we could drive the FUSE adoption among all users.

If there isn't, I think git-aware Bazel improvements might be much needed, specifically toward supporting git-sparse-checkout use cases. Victoria proposal (and Newren's suggestions) would be a great first step toward this direction.

michaeledgar commented 1 year ago

I want to be clear, I absolutely plan to prototype something here in the coming weeks! I agree that the user experience of "also install a read-only FUSE filesystem" just isn't adequate. I'm sorry for the long period of silence on this issue: I neglected to mention that I'd be honeymooning for most of December, a twice-postponed trip originally booked before COVID.

The solutions I offered in my previous post were simply meant as stopgaps that work today, if there is any filesystem (or combined set of filesystems) that provides read-only access to the monolithic source. I'll spend some time trying out existing open-source git filesystems with Bazel and observing what happens in practice, and doing some performance testing. I don't expect the eventual write-up to satisfy many: installing a whole filesystem as step 1.5 to use Bazel is simply asking a lot, even if it's a recommended filesystem.

For everybody else who does not have a complete read-only filesystem for their monorepo, there are a few improvements that can be made.

If Bazel were aware that the workspace were incomplete and knows the git repo and sparse-checkout configuration, it can read content on-demand from git ("read-vcs-on-demand"). This is true, but there is a limiting principle: it's implementing another userspace filesystem. Any code written for Bazel's that looks at all like a gitfs will become a gitfs, and will be worse than all the other gitfs out there because it will be the newest. So it should punch above its weight in benefit when time can also be spent integrating with existing filesystems.

eric-skydio commented 1 year ago

Based on the conversation so far, it sounds like the easiest integration point in Bazel is using --package_path referencing both the sparse checkout of the repository and an underlying source with a read-only view of the entire worktree (even the parts that aren't checked out). As observed earlier, this even works today if the second package path references a FUSE filesystem, but the available FUSE filesystems for git repos are slow especially on non-Linux machines and require lots of system configuration (root, lifecycle management, etc.).

Since the FileSystem class in Bazel is extensible and Google already has other implementations, how hard would it be to create a small FileSystem implementation that speaks a simple protocol over stdin/stdout with a subprocess, then extend the --package_path flag to be able to intermix this new SubprocessFileSystem with disk-backed paths?

--package_path=%workspace%:subprocess//git bazel-vfs --work-tree="%workspace%"

In this case, the git bazel-vfs command could either be distributed with git or installed as a separate program, but wouldn't need root and could be easily made cross-platform. Because Bazel is responsible for starting and stopping it (like a persistent worker) managing the installation and lifecycle of it should be vastly simpler than a FUSE filesystem. Of course, users who do want a complicated persistent service speaking gRPC can write a script that converts to that.

Unlike a disk-backed package_path, the vfs-backed package_path obviously can't be used for executing actions locally, but it should be possible to support all analysis time features including evaluation of globs and maybe even uploading files for remote execution. That means the user doesn't need to add all their transitive dependencies to their sparse checkout, just the files they actually intend to change.

maxious commented 1 year ago

Quick note: git in recent versions has the ability to spawn daemons with interprocess communication for file system watching functionality https://git-scm.com/docs/api-simple-ipc https://git-scm.com/docs/git-fsmonitor--daemon so presumably git bazel-vfs could instead be generalised to something like git cat-file --stdin-streaming-of-file-names

eric-skydio commented 1 year ago

While the proposed command is conceptually similar to git cat-file --batch, my guess is that the protocol will need to support a few different operations, not just map file paths to file content.

Rather than put all that git-specific logic in Bazel, I'd prefer if the protocol were something close to what Bazel wants. I don't have strong opinions about whether the git side is implemented as another flag on cat-file, a new subcommand, or a wrapper that itself invokes cat-file --batch.

lberki commented 1 year ago

I don't know why the existing git / virtual file system bridges are slow, in particular, if they are slow due to the inherent limitations of git, any particular operating system or the implementation of the bridges themselves.

I would much like not to rely on --package_path if we can help because it causes a lot of complexity of Bazel and I'd much rather it go away and thus I'd like not to build critical functionality on top of it.

I think the most feasible approach is some variant of @vdye 's "Bazel pokes at some external tool that materializes packages as it needs them" idea, modulo the issues outlined in https://github.com/bazelbuild/bazel/issues/16380#issuecomment-1287328339 .

eric-skydio commented 1 year ago

I would much like not to rely on --package_path if we can help because it causes a lot of complexity of Bazel and I'd much rather it go away and thus I'd like not to build critical functionality on top of it.

If multi-element --package_path is going to go away, it would also work to either

Those both have some performance impact since every read needs to hit both places, instead of just locating a package, but it would probably still be less than the win from being able to get hashes directly from the VFS implementation.

More fundamentally, I challenge the idea that users actually want to include all the transitive dependencies of their code in the sparse checkout. There are some dependency patterns where your code might technically depend on a large number of other packages at analysis time that in practice you're not actually interested in modifying, and may not want to pay the cost of having on-disk. Using a FileSystem backed by git directly means Bazel can work with whatever sparse checkout the user wants to use (as long as they aren't running local execution actions), potentially including the null checkout for CI workers.

lberki commented 1 year ago

I unfortunately can't state that --package_path is going away, but I'd prefer not to build more functionality on top of it if I can help; it causes quite a lot of complications within Bazel without a lot of benefit so I'd like it to go away, but there are some legacy use cases at Google that are difficult to replace and thus I'm afraid I'll have to live with it for a good while.

I would like to avoid entangling Bazel with source control; sure, we could do a FileSystem implementation that handles git, but then we'd have to get engaged with all the intricacies of git and would eventually have to support other source control systems and since source control is not a core business of Bazel, I'd much rather delegate it to somewhere else behind a conveniently narrow interface.

The current answer to "conveniently narrow interface" for reading sources is FUSE despite all its problems. I do realize that that is getting less and less tenable, but there isn't an easy answer :( The best I can think of is in my previous comment and that's not very satisfying, either.

sluongng commented 1 year ago

https://lore.kernel.org/git/CAJoAoZ=Cig_kLocxKGax31sU7Xe4==BGzC__Bg2_pr7krNq6MA@mail.gmail.com/ should be interesting to the folks here.

Google's Git team wrote about their experience trying to make a VFS on top of git.

lberki commented 1 year ago

Thanks for the link, @sluongng :)