Multiplatform output paths are safe, correct, and efficient

gregestren commented 5 years ago

Tracking issue on Bazel Configurability Roadmap

By "multiplatform" I mean any scenario where two different rules in the same build build with different settings. This also includes non-platform settings like app version, but "multiplatform" is a concise term to capture the essence.

Long-story short is bazel-out/$(cpu)-$compilation_mode)/... doesn't work well for multiplatform builds:

Unrelated actions can inadvertently write to the same output path (correctness issue)
cpu is redundant for cpu-agnostic actions (efficiency issue: switching up the CPU shouldn't require re-executing these actions: see https://github.com/bazelbuild/bazel/issues/6527)
Actions that depend on flags that aren't CPU or compilation mode write to the same path when those flags change (correctness issue)
All the above can destroy remote execution efficiency

This issue tracks the long and complicated effort of making a better output path syntax. Expect the next deliverable on this to be a design doc.

gregestren commented 5 years ago

2018 EOY update:

See https://github.com/bazelbuild/bazel/issues/6527#issuecomment-458357722.

More updates coming Q1'2019.

gregestren commented 5 years ago

April '19 update:

Detailed plans at Experimental Content-Based Output Paths (please comment!).

Goal is to get an --experimental prototype available this summer that automatically caches multiplatform Java compilation.

gregestren commented 4 years ago

P1 issue review: still relevant, still very much intend to explore this but I simply am unable to put time into it at the moment. Hoping I can pick this up next quarter.

gregestren commented 4 years ago

This is being up-prioritized with about ~1 dev's full-time commitment over the next 3 months.

michaelmartak commented 3 years ago

Is there an escalation path at Google (i.e., someone we could reach out to) that could help align on business priorities?

gregestren commented 3 years ago

Write to me as a technical contact (gregce@bazel.build) explaining your needs as best you can. I'm happy to chat technical concerns and CC in folks who can help with business priorities.

gregestren commented 3 years ago

To elaborate on https://github.com/bazelbuild/bazel/issues/6526#issuecomment-658927627,

I believe the generic solution described in this issue and https://github.com/bazelbuild/bazel/issues/6526#issuecomment-488103473 is nuanced enough that it'd have to go through a long experimental phase before we could consider productionizing parts.

I'd still like to get to that phase because it would still let interested folks opt in, explore, and help evolve its path.

But we're also trying to explore if there are more limited variations we could hack out more quickly while avoiding the deeper design issues. That's going to be the focus of the current up-prioritization. We have an idea of something (hopefully) quick and dirty that could approximate a lot of this, probably with a small code injection into the remote executor client. I'll continue to follow up here.

Speaking of, is anyone interested in this and not using remote execution?

keith commented 3 years ago

Speaking of, is anyone interested in this and not using remote execution?

We are. We're in a setup where we have macOS dev machines and Linux CI machines for Android builds. We're hoping to use remote exec at some point but atm we're only using a remote cache, which we're thinking this might help with since right now the 2 platforms don't share cache hits

gregestren commented 3 years ago

Acknowledged, thanks.

plaird commented 3 years ago

Exact same with @keith for my team's use case except plain old Java->jars, not Android.

gregestren commented 3 years ago

I'm sorry I haven't updated this for a while. Quick update is I recently experimented with a limited form of this as suggested at https://github.com/bazelbuild/bazel/issues/6526#issuecomment-665836057. Initial results look promising.

I want to do another test over a sample project (maybe Bazel itself?) to verify the results. Then I need to look at injection points, since Bazel has different APIs for delegating to local and remote executors and this change is likely to live in the implementation layer.

I'm spending a good chunk of this week doing the above. As always, please ping (or reach out to me directly) if you're wondering what's up in between updates.

ulrfa commented 3 years ago

Thanks for update @gregestren! Your work in this area is very much appreciated!

Would you like to elaborate about the scope of the “limited form" of #6526? Which use cases do you expect it to support, and which not?

I interpret it as that a complete and generic solution with production quality of #6526, is still the final goal, but realistically more than a year away. Is that correctly interpreted? Would you dare to make a very rough time estimate?

Again, thank you for all effort in this area, it is very important for us to not explode the executor workload when using transitions for our c/c++ applications, in examples like: https://groups.google.com/g/bazel-discuss/c/zVEc7gzbyu0

gregestren commented 3 years ago

@ulrfa sure!

The generic approach I outlined in https://github.com/bazelbuild/bazel/issues/6526#issuecomment-488103473 tries to balance a variety of needs, including the need that the paths the executor sees are identical to what appears in Bazel's final output tree. That makes actions that write manifests or debug symbol paths safe.

If we drop that requirement, that opens up a much simpler algorithm: strip the config-specific info completely from the paths before shipping them to the executor, then add them back when writing them to Bazel's output tree. So bazel-out/x86-fastbuild-someconfighash/mypkg/myoutput gets staged as bazel-out/mypkg/myoutput, cached-checked on the executor accordingly, executed, and rewritten back to its original path when done.

That exposes the risks from my first paragraph. But not every action has that risk. Lots of actions truly don't care what their input or output paths look like. So this new approach would introduce criteria for which actions are "safe" in this regard and rewrite paths for safe actions. We could presumably start with a small and conservative safety set, then expand as we vet more actions.

Java actions I think are particularly good candidates for this. C++ has the extra challenge of debug mode symbol paths. But that's only a certain subset of C++ actions. Not all of them.

For https://groups.google.com/g/bazel-discuss/c/zVEc7gzbyu0, another complementary idea is "trimming" - if it's really only the binary that consumes the flag, we could simply remove that flag from configurations in its dependencies. I already have a tool we could conceptually use to make this happen. But it'd require preprocessing: every time a BUILD file changes you'd have to rerun that tool to annotate the BUILD rules. A 100% automatic approach would be ideal.

Time-wise, I'd like to share some clearer experimental results on some Java actions over the next month or two. If that all looks good I don't see why we can't enable this limited approach by, say, January. It might take more tweaks to figure out the C++ nuances.

ulrfa commented 3 years ago

Thanks @gregestren!

C++ has the extra challenge of debug mode symbol paths. But that's only a certain subset of C++ actions. Not all of them.

What subset of C++ actions do you mean? Does the subset include all actions compiling source code with debug symbol paths? Unfortunately we need to compile our C/C++ code with debug symbol paths.

For https://groups.google.com/g/bazel-discuss/c/zVEc7gzbyu0, another complementary idea is "trimming" - if it's really only the binary that consumes the flag, we could simply remove that flag from configurations in its dependencies. I already have a tool we could conceptually use to make this happen. But it'd require preprocessing: every time a BUILD file changes you'd have to rerun that tool to annotate the BUILD rules. A 100% automatic approach would be ideal.

Trimming is interesting! I guess that would also reduce build graph size and RAM requirement. I will have a look at your tool! But unfortunately, we have a deep build graph, with many configuration options consumed by lots of cc_library. It would be hard for us without an automatic approach.

Do you as final goal, aim for an automatic trimming solution and/or an output path solution handling C/C++ code with debug symbol paths? If yes, would you like to give a rough time estimate?

I'm sorry to bother you about the time estimates. We are considering if going all-in with transitions, and your input about what to expect, and roughly when, is essential for us in that decision.

gregestren commented 3 years ago

What subset of C++ actions do you mean? Does the subset include all actions compiling source code with debug symbol paths? Unfortunately we need to compile our C/C++ code with debug symbol paths.

Yes, I mean actions that rely on paths for resolving debug symbols vs. those that don't. Although it's not just that, it's also whatever consumes those paths (like gdb). If you're not actually debugging maybe this doesn't matter. But if you need debug symbol paths I guess that's not the case?

This isn't to say there aren't options. We could conceivably rewrite the symbol paths after the fact. But that'd be a specialized effort.

Trimming is interesting! I guess that would also reduce build graph size and RAM requirement. I will have a look at your tool! But unfortunately, we have a deep build graph, with many configuration options consumed by lots of cc_library. It would be hard for us without an automatic approach.

They key point in my mind is if your top-level binary is the only one that actually consumes the flag in question, then we'd have some real options, no matter what cc_librarys in the subgraph do. If those cc_librarys really need to behave differently based on these options then by definition they wouldn't be shareable anyway. We'd need more details on exactly how the flag is used to clarify assumptions.

Do you as final goal, aim for an automatic trimming solution and/or an output path solution handling C/C++ code with debug symbol paths? If yes, would you like to give a rough time estimate?

That would be wonderful, but it's an ambitious goal that I can't credibly put a timeline on. I'm trying to focus effort on incremental steps forward, so we can see credible practical progress vs. a reallllly long wait with unclear outcome.

So in my view the status quo is for us to identify optimizable use cases and try to optimize them. Not try to automatically make everything work at peak efficiency.

I'm sorry to bother you about the time estimates. We are considering if going all-in with transitions, and your input about what to expect, and roughly when, is essential for us in that decision.

No worries. I'm not sure my input is helping you with this decision. I guess I'm ultimately saying we need to understand the precise requirements of specific builds and aim optimizations at improving those builds (and whatever other builds have the same patterns). So the real answer, as usual, is in the details.

burkpojken commented 3 years ago

Hi! I work at the same project as ulrfa, I have also written this question in the forum https://groups.google.com/g/bazel-discuss/c/zVEc7gzbyu0/m/5UcZ8aXOBQAJ

I try here to describe our use case:

We build C/C++ applications for an embedded system with quite large build graphs with very many configurable options using "User-defined build settings" https://docs.bazel.build/versions/master/skylark/config.html

Examples of configurable options are:

Select bazel targets based on HW configuration,
Select bazel targets based on in what environment the application will be used, like test environment or customer environment
Select bazel targets or set C defines (-D flag) in cc_* targets for stubbed testing where some parts of the system are stubbed for testing purpose

The targets that are affected by the options can be at any level in the dependency tree. Many options have private visibility and only affect a sub-part of the system, but we depend on that the correct command line options are set when the application is built.

The typical use case is that you build one application with some specified configurable options.

If you do this on the command line everything will be built in the default output tree. If you change one option or build another application with one option that differs, only the targets that are affected by the option will be rebuilt, everything else can be reused.

If you do this in a transition, nothing can be reused between the builds.

This will cause a lot of rebuilds if something in e.g. some common code is changed. It will also force the need of a much larger remote cache storage.

We need to be able to debug the application targets, we use gdb and depend on that the debug symbols are correct to be able to show the source files.

ulrfa commented 3 years ago

Thank you for your answer @gregestren!

I just wrote a separate ticket about an idea. Maybe it could be one incremental step forward: #12568

zachgrayio commented 3 years ago

Any additional updates here yet @gregestren ?

gregestren commented 3 years ago

Hi @zachgrayio

@ulrfa and I are having some interesting discussion on https://github.com/bazelbuild/bazel/issues/12568 that I believe yields one plausible possibility.

Otherwise I've successfully done some experiments with Java compilation that show promising results. It's less clear to me at the moment how that scales. Not that i think it doesn't scale, just that many builds have lots of non-Java actions. So if Java compilation is 5% of a build's actions how much total benefit is possible?

I still want to share results soon but it remains slowgoing. It's basically a part-time effort from just me now while the core dev team focuses on migrating Android rules to the platforms API.

Also see the pending update for configurability's roadmap for more context: https://github.com/bazelbuild/bazel-website/pull/289

gregestren commented 3 years ago

Update:

Seeing promising results on various sample builds (including real builds, not just toys) with Java compilation. Complications to take into account:

.jdeps files ¹ ² ³ are manifests of a Java target's deps' jars. Since I'm messing with their paths, they need special processing.
Have to take care when the same input appears in different configurations. This is most likely for 'exec' and 'host' tools.
Just as good results for Android actions as Java actions
The JDK binary has to be consistent for good caching. i.e. a Mac JDK and Linux JDK probably can't share caches since it's literally different binaries running the action.

I'm moving into checking in my functionality behind an --experimental flag. It'll also need hooks for the executor (remote? sandbox?). I'm still trying to plant a flag in the ground with Java actions, but this should still conceivably be expandable more generally.

keith commented 3 years ago

The JDK binary has to be consistent for good caching. i.e. a Mac JDK and Linux JDK probably can't share caches since it's literally different binaries running the action.

This was the use case we're the most interested in here, could version or something be used instead as the inputs?

guw commented 3 years ago

The JDK binary has to be consistent for good caching. i.e. a Mac JDK and Linux JDK probably can't share caches since it's literally different binaries running the action.

This was the use case we're the most interested in here, could version or something be used instead as the inputs?

To echo that - our need for caching is being able to produce the Java related caches on Linux and ability to consume them on Macs. Both Linux and Mac JDKs produce the same cross-platform output files and should be treated as one configuration by Bazel.

gregestren commented 3 years ago

I get what you're saying from a user perspective. From an implementation perspective that's a different focus. There are three implementation themes at play here:

Improve how output paths (bazel-out/x86-fastbuid/...) are formed, so changing x86 or fastbuild doesn't by itself invalidate caches. That's what I'm focusing on now.
Verify the output consistency of cross-platform JDKs. I know that's the whole point of Java. But given leaky abstractions, etc., I'd want to sanity check this with some JDK experts. I can CC appropriate folks on this one.
Remote execution protocol. It's a pretty core assumption that an action is cacheable if a) its command line is the same, b) its input and output paths are the same, and c) its input digests are the same. Cross-platform caching suggests designing a more nuanced algorithm. I'd need to loop in some executor experts to make more sense of this.

If it's a large task to address that last point generally, I wonder if an "execution proxy" could work? If we have no general way today to declare "these actions look different but they're really the same", and devs are cautious to model that since it adds risk of correctness failures, then what if an org runs their own executor service that just passes requests to the real service? Then it can add custom rules that, say, auto-map a Mac-based JDK path/digest to the Linux one before passing it on.

It's then on the org to define these rules in ways they're comfortable with. And it doesn't require modifications to core execution protocols, and all the caution that necessitates.

keith commented 3 years ago

For reference the Xcode selection logic is an interesting example of how the actual inputs are abstracted away from some identifier, in this case the version number, which allows different versions potentially to be treated the same, and at the very least abstracts the direct inputs away from the inputs as bazel sees it.

zachgrayio commented 3 years ago

Then it can add custom rules that, say, auto-map a Mac-based JDK path/digest to the Linux one before passing it on.

I'm all for getting creative here to make this work--we do a lot of this stuff in our backend systems, but this actually sounds pretty similar to some of the hacks people are already doing today to fix the output paths issue to work towards sharing artifacts that "should be" shared. It's not really very ergonomic for everyone to have to solve it over again for their org when adopting the tool.

schultetwin commented 3 years ago

Following along here. We have the exact same use case as @ulrfa describes above (building embedded systems using Bazel, and havings lots of configuration options). Similar issues with gdb as well :). Happy to help test/review changes if that would be helpful. I have high hopes for Bazel in the embedded space :).

gregestren commented 3 years ago

@schultetwin which kinds of actions would you like to see caching on? All C++?

I'll have to review this thread, since there are a variety of different ideas here. It's on my TODO list to refresh soon, as I've started checking in some experimental code (https://github.com/bazelbuild/bazel/commit/bb2941bccb0d010223fcbd07139e437b219e837a and https://github.com/bazelbuild/bazel/commit/526ea392ac50a0f11eb65fd29a6cde5962a08c97, which is currently rolled back due to memory pressure). I'm currently trying to get https://github.com/bazelbuild/bazel/commit/526ea392ac50a0f11eb65fd29a6cde5962a08c97 committed again, at which point I intend to rapidly explore better caching specifically for Java.

Support for other actions should naturally grow out of this.

At the moment the code requires buy in logic from the executor. It's not complicated per se, but we'll have to review Bazel's client-side executor APIs to get the right code injected to let this really start taking off.

That's all I've got at the moment.

schultetwin commented 3 years ago

Thank you @gregestren.

All C/C++ actions. Some background that might be helpful:

We're building firmware binaries for 10+ hardware systems (and probably 20-30 different bazel platforms, as each hardware platform might have different configurations per application running on it) all out of the same codebase. All hardware systems have a similiar CPU, cortex-m4, but some files will require slightly different -Ddefine options when actually executing gcc. However, we have multiple hardware devices that have almost identical setups, but do different things. (i.e.: Two electronic locks that are used slightly differently, but could share almost all compiled object files except for maybe on or two). Today, we have to rebuild every .c or .cc file for each platform, even though most of the actions are redudant.

As a second problem, most of our engineers develop on OSX, but our CI system compiles in Linux. Our CI is what populates the cache, so our OSX users cannot use the cached values. However, we're using ARM's gcc on both platforms, so should be able to share the output. This is similar to the above Java discussion I believe.

gregestren commented 3 years ago

Got it. If you're saying that most of the files won't require a different -Ddefine (which by itself invalidates caching, since it's legitimately a different action), I think there's good opportunity to extend what I'm working on now to address that. Caveat being we have to be careful about debug builds, where the compiler embeds paths to the source files in the object files. If your builds in question don't need that, that makes things much easier.

Understood about the second problem. That's a common request and I really want to get traction on it. The biggest concern, I think expressed somewhere on this issue, is when the tool binaries or paths are different on different architectures. In some discussion we talked about aliasing to possibly work around that.

schultetwin commented 3 years ago

If you're saying that most of the files won't require a different -Ddefine (which by itself invalidates caching, since it's legitimately a different action),

yep! that’s exactly it.

Caveat being we have to be careful about debug builds, where the compiler embeds paths to the source files in the object files. If your builds in question don't need that, that makes things much easier.

Our builds would need this data unfortunately :(. We need to be able to debug the images that are created. One addition here, some of our source files are auto generated as part of our build. And so we need access to those sources as well.

In some discussion we talked about aliasing to possibly work around that.

ah, interesting. Is this aliasing of tool paths? Or aliasing at the tool level? (I.e.: Tell Bazel, trust me these two tools are identical)

fmeum commented 2 years ago

Just started thinking about this a bit more and I am having a hard time coming up with realistic examples of actions with both of the following properties:

The actions depends on two artifacts that, with the config prefix stripped, would live at the same output path (e.g. libnative.so build for Linux and macOS).
The action is not a pure packaging action (e.g. one that bundles binaries or libraries built for different targets into an archive or APK).

I would be very interested to learn about examples both inside and outside Google. If none come to mind, maybe this could open up other trade-offs for the path mapping scheme.

gregestren commented 2 years ago

My main immediate motivating use case is Java and Android compilation, where target CPU changes shouldn't affect most of their inputs or outputs.

It gets weirder when you have different OS's where the compilers themselves are different.

gregestren commented 2 years ago

Thankfully moving the train again: https://bazel-review.googlesource.com/c/bazel/+/174452

github-actions[bot] commented 1 year ago

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

guw commented 1 year ago

@bazelbuild/triage the issue is still relevant and needs to remain open. Especially because it seems to be connected with #8339.

plonter123 commented 5 months ago

We are also affected by this issue. We have a multi-platform, multi-configuration C++ project which requires compiling for multiple platforms at the same build command, therefore we have to use transitions (which we define using with_cfg.bzl), and we experience that these platforms cannot share any cache even though they're mostly similar. In addition, we use the transitions to pass down user settings, which means we have to change the transitions a lot, which every time results in a full rebuild. As C++ isn't supported for path mapping, we experience full rebuilds a lot, and mostly don't benefit from bazel's strong caching.

BalestraPatrick commented 4 months ago

@gregestren Do you know if there are plans to extend this support to Kotlin actions as well (KotlinCompile, KotlinApt, etc.)?

fmeum commented 4 months ago

@gregestren Do you know if there are plans to extend this support to Kotlin actions as well (KotlinCompile, KotlinApt, etc.)?

I will be working on that after https://github.com/bazelbuild/bazel/pull/19723 lands. It may also require lazy depset transformations to implement efficiently, but an experimental implementation that accepts increased memory usage should be doable without.

lior10r commented 3 months ago

@gregestren Are there any plans adding support to c++ actions? Would be a huge help for us.

fmeum commented 3 months ago

Can't promise anything, but I am planning to look into C++ some time in April.

fmeum commented 1 month ago

I just opened a discussion that documents the current state of path mapping in Bazel: https://github.com/bazelbuild/bazel/discussions/22658 I am planning to keep this up-to-date going forward.

bazelbuild / bazel

Multiplatform output paths are safe, correct, and efficient #6526