No way to use toolchains correctly in executable rule output

anguslees commented 11 months ago

Description of the feature request:

A common pattern[^1] I see in rules is toolchain declarations that effectively capture a tool path, and templated executable/test rules that resolve the toolchain and substitute the tool path into the executable file output.

Unfortunately this is incorrect! When the exec and target platforms differ, the toolchain will be resolved for the exec platform, but the executable/test output needs to be for the target platform.

As far as I can see, there is no way to make toolchain resolution / platform transitions do the correct thing here. Since executable rules are how bazel tests work, this is a pretty major gap in the toolchains/platforms story. I can't file bugs against rules, because there is no reasonable solution to suggest. This is a feature request for a solution that is (a) correct, and also ideally (b) easy/obvious.

One onerous current workaround is to declare two toolchains for all such cases. One that uses exec_compatible_with/target_compatible_with as usual, and is used when the toolchain is invoked by an action directly. And a second toolchain that uses target_compatible_with only, which can be used in executable rule output. This is moderately terrible, since this duplicates work for everybody, confuses the meaning of exec_compatible_with vs target_compatible_with, and toolchain authors can't anticipate when their toolchain needs to be used in this way - undoing the loose coupling between author and consumer that is kind of the point of toolchains.

An incorrect (afaics) version of the two-toolchains approach is to only declare target_compatible_with toolchain, then create a "current_foo_toolchain" rule that resolves the toolchain for a particular target, and returns the toolchain provider through a rule attribute - see eg python toolchain. This is incorrect in general because another toolchain that depends on this toolchain cannot be resolved at the same time for the same exec+target combination (ie: toolchains=[foo] with a dependency on another rule with toolchains=[bar], is not the same as toolchains=[foo, bar]).

As a possible implementation, I'd love to be able to do exec_group(cfg='target'), since I think ctx.exec_groups["target"].toolchains is an intuitive and 'obvious' way to address the above.

Another option is to not use toolchains for what are effectively artifacts. Rule authors obviously like the loose coupling between toolchain declaration and use however, so this potential path would require widespread advocacy/documentation, and perhaps a canonical meta-library for an "artifact toolchain".

[^1]: Examples of (incorrect) toolchains used for executable rules in rules_js, rules_docker, rules_oci, gazelle

Which category does this issue belong to?

Core

What underlying problem are you trying to solve with this feature?

It should be possible to resolve and refer to the correct toolchain for an executable rule output.

Which operating system are you running Bazel on?

macos and Linux

What is the output of `bazel info release`?

6.3.2

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

moroten commented 7 months ago

Any updates on this?

katre commented 5 months ago

I'm very confused about your use case, hopefully we can figure out what makes sense.

Typically, the intent of toolchains is that they are used during build actions: a c++ compiler is called, an analyzer is run, whatever.

It sounds like you want some files from a toolchain, but you want your build rule to bundle those into the outputs? We've used toolchains for that in a few cases (the Python runtime is probably the best example), but it's always been a tricky fit, because it's using the tools as an output to be passed along.

Maybe you don't want a toolchain, you want a direct (target configured) dependency on the tools, so that they can be bundled into your rule's output?

anguslees commented 5 months ago

Maybe you don't want a toolchain, you want a direct (target configured) dependency on the tools, so that they can be bundled into your rule's output?

I agree, but see my numerous linked examples in the footnote of the opening post. This mis-pattern is quite widespread, because folks (apparently) want to use toolchains in tests or bazel-run scripts (ie: the output of executable rules).

This is a feature request for an improvement to bazel toolchains/exec_groups to make it possible to resolve toolchain for a rule's target/parent. Or alternatively we loudly declare this will 'never' be possible, and (as a community), we go and rewrite these examples to avoid toolchains.

I'm ok with either path to address this issue - right now I'm kind of stuck waiting for a decision. Folks "expect" this to work, but it doesn't - and if I'm going to delete their toolchain then I'd at least like to be able to link to this PR (or some other doc) as evidence for why that was the correct change to their rules.

katre commented 5 months ago

Toolchains (uniquely) have two options for their dependencies:

Anything with cfg = "exec" is configured for the exec platform, which is the same as the requesting target's exec platform
Anything with cfg = "target" (which is the default if cfg isn't specified) is in the same configuration as the requesting target.

So part of the solution is probably to be more careful of the target/exec distinction, here. It's unfortunately easy to get this wrong in the default case, since the base case is that the host platform is used as both the only exec platform and as the default target platform.

I agree that there is probably more we can be doing to make this cleaner. I'm going to add this item to our backlog for tracking purposes. If you have any specific features you'd like to request, we'd be happy to see a proposal or further discussion to move this along.

anguslees commented 5 months ago

So part of the solution is probably to be more careful of the target/exec distinction, here.

I think we're talking past each other. I (believe I am) familiar with cfg=target/cfg=exec and when to use them. This does not help with the problem described in this issue.

I'll explain one example here in detail, to see if that helps:

rules_oci has a target that can be used similar to bazel run //my/image:push to upload an artifact (it's an OCI image, but the specific file format is not relevant) to a repository. As with all other bazel-run targets, the rule for this //my/image:push target is an "executable" rule, which returns the file (commonly a script) that is going to be executed, but doesn't actually run the file itself.

This executable rule in this case is the following:

def _impl(ctx):
    crane = ctx.toolchains["@rules_oci//oci:crane_toolchain_type"]
    yq = ctx.toolchains["@aspect_bazel_lib//lib:yq_toolchain_type"]

    if not ctx.file.image.is_directory:
        fail("image attribute must be a oci_image or oci_image_index")

    _, _, _, maybe_digest, maybe_tag = util.parse_image(ctx.attr.repository)
    if maybe_digest or maybe_tag:
        fail("`repository` attribute should not contain digest or tag. got: {}".format(ctx.attr.repository))

    executable = ctx.actions.declare_file("push_%s.sh" % ctx.label.name)
    files = [ctx.file.image]
    substitutions = {
        "{{crane_path}}": crane.crane_info.binary.short_path,
        "{{yq_path}}": yq.yqinfo.bin.short_path,
        "{{image_dir}}": ctx.file.image.short_path,
        "{{fixed_args}}": " ".join(_quote_args(["--repository", ctx.attr.repository])),
    }
    if ctx.attr.remote_tags:
        files.append(ctx.file.remote_tags)
        substitutions["{{tags}}"] = ctx.file.remote_tags.short_path

    ctx.actions.expand_template(
        template = ctx.file._push_sh_tpl,
        output = executable,
        is_executable = True,
        substitutions = substitutions,
    )
    runfiles = ctx.runfiles(files = files)
    runfiles = runfiles.merge(yq.default.default_runfiles)
    runfiles = runfiles.merge(crane.default.default_runfiles)

    return DefaultInfo(executable = executable, runfiles = runfiles)

# Refactored slightly from the original for clarity
oci_push = rule(
    implementation = _impl,
    attrs = _attrs,
    toolchains = [
        "@rules_oci//oci:crane_toolchain_type",
        "@aspect_bazel_lib//lib:yq_toolchain_type",
    ],
    executable = True,
)

To summarise the relevant parts: this rule gets the paths for two tools (crane and yq) using bazel toolchains, substitutes those paths into a shell script template, and then returns that shell script as the executable output file that will eventually be invoked by bazel-run.

But this use of toolchains is incorrect when cross-compiling!

Specifically: When bazel host platform != exec platform (eg: run bazel on macos with a linux RBE cluster), the rule implementation above will run on the exec (linux) platform, but the output executable needs to work on the host (macos) platform. The toolchains will incorrectly resolve to the exec (linux) platform, and the executable output will be a script that points to linux tools that will fail when run on macos. The correct crane and yq tools to use here should be the tools for the target platform of this rule, and not the exec platform.

But there is NO WAY to resolve the toolchain for the target of a rule!

^^This is the bazel limitation described in this github issue. If there is anything I can do to expand or clarify the above explanation, please suggest it because it seems to be subtle/difficult for folks to recognise the issue here and I'd like to point folks to this comment for enlightenment.

Afaik cfg=target is not relevant here, since there is nowhere to put that that will affect toolchain resolution - the ctx.toolchains["@rules_oci//oci:crane_toolchain_type"] result will always be resolved for the exec platform.

Afaics this incorrect use (or misunderstanding) of toolchains is moderately common (see my examples in widely used rules), and can impact both bazel-run and bazel tests (both executable rules). Ideally, we improve bazel so that toolchains can be used in these important use cases.

moroten commented 5 months ago

This looks very related to #21805, where I suggest the --run_under target should be built for the host platform.

anguslees commented 4 months ago

This looks very related to https://github.com/bazelbuild/bazel/discussions/21805

Related, but not the same - just to prevent an overzealous close-as-duplicate.

I agree --run_under (and run generally) should be built for host platform. Providing a better default --platform won't fix this no-way-to-resolve-toolchain-for-other-platform issue, however.

EdSchouten commented 3 months ago

Hey @anguslees,

I did some experimenting, and it looks like what you want is possible by doing the following:

You move the toolchains into a helper rule, which returns the toolchain information through a provider.
You instantiate a target of that helper rule in some BUILD file.
You let the original rule depend on the target of the helper rule, and extract the toolchains from the provider.
You apply an outgoing edge transition on that dependency.

The outgoing edge dependency can make use of the following transition:

def _transition_exec_to_target_impl(settings, attr):
     return {
         "//command_line_option:extra_execution_platforms": settings["//command_line_option:platforms"],
     }

 transition_exec_to_target = transition(
     implementation = _transition_exec_to_target_impl,
     inputs = ["//command_line_option:platforms"],
     outputs = ["//command_line_option:extra_execution_platforms"],
 )

I will send out a PR against rules_oci soon to get oci_push() fixed.

That said, I do think that there is a bit of voodoo magic going on. Transitioning on command line flags is a bit funky. Especially because we're doing it on --extra_execution_platforms (which is fortunately documented as prepending to the list of registered platforms, which is exactly what we want). Maybe it makes sense to add something like this, either to @bazel_tools, @bazel_skylib, or @aspect_bazel_lib. Any preference for a place?

anguslees commented 3 months ago

Oh wow - I remember considering this, but I think I must have just incorrectly concluded that transitioning --extra_execution_platforms wouldn't work, without trying.

Fwiw on bazel 6.4.0, I needed to force the //command_line_option:platforms labels to strings first:

def _transition_exec_to_target_impl(settings, attr):
     return {
         "//command_line_option:extra_execution_platforms": [str(p) for p in settings["//command_line_option:platforms"]],
     }

Combined with the resolved_toolchain() idiom I've seen around a few places, this is actually not terrible. In fact I suspect many folks who have copied resolved_toolchain from rules_java - but are not actually portable across platforms - need to add this transition in order to be correct.

katre commented 2 months ago

Instead of using a transition to add new execution platforms, consider using a flag to activate them (once I sit down and implement the proposal, hopefully this week).

EdSchouten commented 2 months ago

Hey @katre,

Even though that proposal you linked looks interesting, I don’t really understand how it’s relevant in this context. In this specific case we want to force toolchain resolution so that the resulting toolchain can be executed on whatever the current target platform is. I don’t think that can be solved by conditionally enabling registration of platforms.

katre commented 2 months ago

You're right, I misunderstood your workaround, sorry about that.

BoleynSu commented 2 months ago

This can be solved by defining two toolchain_type for the toolchain. toolchain_type(name="tool") and toolchain_type(name="runtime"). We already do this for Python and Java.

EdSchouten commented 2 months ago

This can be solved by defining two toolchain_type for the toolchain. toolchain_type(name="tool") and toolchain_type(name="runtime"). We already do this for Python and Java.

Are those the toolchains that intentionally have exec/target_compatible_with set the wrong way around? I’ve never been a fan of that approach.

fmeum commented 2 months ago

Are those the toolchains that intentionally have exec/target_compatible_with set the wrong way around? I’ve never been a fan of that approach.

I don't know how Python does this, but I'm pretty sure that the Java setup is correct (although a bit complicated): https://github.com/bazelbuild/bazel/blob/2fb8564f7b0abfb669685df4873ffb8a5bb14741/tools/jdk/BUILD.tools#L11

BoleynSu commented 2 months ago

@EdSchouten there is no hack. Using transition for this is more hacky to me.

Basically,


toolchain_type(name="tool")
toolchain_type(name="runtime")

runtime_rule = rule(binary = label(cfg = "target"))
tool_rule = rule(binary = label(cfg = "exec"))

runtime_rule(name="jq_amd64", binary = "@jq_linux_amd64")
toolchain(name="jq_amd64", toolchain = "jq_amd64", target_compatible_with = linux_amd64, toolchain_type=runtime)

tool_rule(name="tool_jq_amd64", binary = "@jq_linux_amd64")
toolchain(name="tool_jq_amd64", toolchain = "tool_jq_amd64", exec_compatible_with = linux_amd64, toolchain_type=tool)

You can then use runtime for outputs and tool for build actions.

BoleynSu commented 2 months ago

The above is also what I proposed in https://github.com/aspect-build/bazel-lib/issues/747 which is to solve the exactly the same issue you want to solve in https://github.com/bazel-contrib/rules_oci/pull/590

EdSchouten commented 2 months ago

@BoleynSu

runtime_rule(name="jq_amd64", binary = "@jq_linux_amd64")
toolchain(name="jq_amd64", toolchain = "jq_amd64", target_compatible_with = linux_amd64, toolchain_type=runtime)

The way toolchain() is called, this is basically saying: "Behold! Here is a copy of jq that can be executed on any platform, and is capable of targeting Linux." This is, in my opinion, incorrect for two reasons:

That copy of jq can't be executed on any platform. For example, my Mac is unable to run it.
jq doesn't support the notion of a target platform. It is a tool that is intended to perform queries against JSON data. It is not capable of emitting any machine code for Linux amd64.

Furthermore, how would you deal with the case where you actually care about a true target platform? What if I'm on a Linux machine, and I would like to bazel build a shell script to be run on OpenBSD that ships with a C/C++ compiler that targets Sun Solaris? There is no target_of_target_compatible_with that we can set. In that case we have three platforms we need to deal with, and platform() only allows specifying two.

I also don't think it's reasonable to request that authors of toolchains declare all of their toolchains in twofold, just to distinguish the case where it's called as part of a build action, or as part of bazel run. That's why I'm calling it a hack.

Note that I'm not saying that the transition that I wrote is ideal. All I'm saying is that it's at least more correct and more flexible than having so-called 'runtime toolchains'.

BoleynSu commented 2 months ago

For the runtime toolchain, the binary is only supposed to be used in the output and should not be used to run actions. Unless the output is used as exec cfg by other rules. In this case, the exec cfg of the other rule will be the target cfg of this runtime toolchain.

Sent from https://boleyn.su/phone

On Sat, Jun 1, 2024, 01:37 Ed Schouten @.***> wrote:

@BoleynSu https://github.com/BoleynSu

runtime_rule(name="jq_amd64", binary = @.***_linux_amd64") toolchain(name="jq_amd64", toolchain = "jq_amd64", target_compatible_with = linux_amd64, toolchain_type=runtime)

The way toolchain() is called, this is basically saying: "Behold! Here is a copy of jq that can be executed on any platform, and is capable of targeting Linux." This is, in my opinion, incorrect for two reasons:

That copy of jq can't be executed on any platform. For example, my Mac is unable to run it.

jq doesn't support the notion of a target platform. It is a tool that is intended to perform queries against JSON data. It is not capable of emitting any machine code for Linux amd64.

Furthermore, how would you deal with the case where you actually care about a true target platform? What if I'm on a Linux machine, and I would like to bazel build a shell script to be run on OpenBSD that ships with a C/C++ compiler that targets Sun Solaris? There is no target_of_target_compatible_with that we can set.

I also don't think it's reasonable to request that authors of toolchains declare all of their toolchains in twofold, just to distinguish the case where it's called as part of a build action, or as part of bazel run. That's why I'm calling it a hack.

— Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/19645#issuecomment-2142705447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXVSKGC6DSFQBKRUFIFDADZFCYOXAVCNFSM6AAAAAA5IRL5TSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSG4YDKNBUG4 . You are receiving this because you were mentioned.Message ID: @.***>

BoleynSu commented 2 months ago

@EdSchouten For example,

jq_binary = rule(use jq_runtime_toolchain's binary as output)

jq_binary(name="jq") # if you build :jq, you will get a binary for your target platform

genrule( # if you build :run_jq, it will use a binary for you exec platform
name="run_jq",
tools=["jq"]
)

# for bazel rules impl
read_jq_file = rule(use jq_tool_toolchain's binary to read ".data" from src and write to out)

read_jq_file(name="read", src="input", out="output") # This will correctly use the right binary as well.

BoleynSu commented 2 months ago

The way toolchain() is called, this is basically saying: "Behold! Here is a copy of jq that can be executed on any platform, and is capable of targeting Linux." This is, in my opinion, incorrect for two reasons:

Thus, this is correct, because the toolchain (runtime one) is supposed to be (exec-)usable on any platform. The toolchain does not execute any action at all, it only produce a binary that is compatible with linux_amd64. You can think of it as there is a magic binary that works on any platfrom and it will always output a binary of jq that is compatible with linux_adm64.

anguslees commented 2 months ago

This can be solved by defining two toolchain_type for the toolchain

Yes. I tried to respond to this solution in my original issue description:

One onerous current workaround is to declare two toolchains for all such cases. One that uses exec_compatible_with/target_compatible_with as usual, and is used when the toolchain is invoked by an action directly. And a second toolchain that uses target_compatible_with only, which can be used in executable rule output. This is moderately terrible, since this duplicates work for everybody, confuses the meaning of exec_compatible_with vs target_compatible_with, and toolchain authors can't anticipate when their toolchain needs to be used in this way - undoing the loose coupling between author and consumer that is kind of the point of toolchains.

I think those objections still stand. You're basically suggesting that every "tool" toolchain should declare/register two toolchains, since the toolchain providers can't anticipate whether a consumer is going to want it as an "exec" or a "target" toolchain. I agree that works; in composing this feature request, I felt this was unacceptably onerous on the entire ecosystem - or at least requires a "decision" that this is how we want to use toolchains, that we can point to and rally behind.

Note: An aside for others casually skimming this discussion:

rules_python registers a "target only" toolchain (no exec constraints), and uses it through a "current_py_toolchain" dependency - essentially turning the toolchain into a dep. This works, but moving exec constraints to target constraints in this way is (I claim) confusing, and fails when toolchains depend on other toolchains (eg python might require cc toolchain), since it loses the ability to represent separate exec/target constraints when required.
rules_java registers two toolchains: one intended for use at execution time (exec constraints) and one for runtime (target constraints). This seems appropriate for Java, which has a historical separation between execution (jdk) and runtime (jre), and of course has standardised cross-platform/cross-implementation artifacts. I don't think it's reasonable to generalise this to other languages/toolchains.

BoleynSu commented 1 month ago

See https://github.com/aspect-build/bazel-lib/pull/875 for how it can be simplified.

You're basically suggesting that every "tool" toolchain should declare/register two toolchains,

Not very toolchain, only those who want to be used in cfg=target. Note that a toolchain can have binary for both cfg=exec and cfg=target.

For example, python toolcahin can have python-bytecode-compiler with cfg=exec and python-interpreter with cfg=target. For a py_binary rule, it use python-bytecode-compiler to generate bytecode and package a binary that runs python-interpreter with the generated bytecode. How do you handle this case without having seperate toolchain for python-bytecode-compiler and python-interpreter?

@EdSchouten's transition approach although seems to work but it does not really. There are cases it won't work. See https://github.com/bazel-contrib/rules_oci/pull/590#issuecomment-2200670829

anguslees commented 1 month ago

Not very toolchain, only those who want to be used in cfg=target

As described earlier in this github comment thread, the toolchain author can't know if their toolchain is going to be used in a test/run (cfg=target), or used directly in an action (cfg=exec). So yes, we would have to pre-emptively declare and configure every 'tool' toolchain twice, just in case.

For a py_binary rule, it use python-bytecode-compiler to generate bytecode and package a binary that runs python-interpreter with the generated bytecode. How do you handle this case without having seperate toolchain for python-bytecode-compiler and python-interpreter?

This is ~exactly what transition and toolchains were designed to support aiui! The classic case is a C compiler, that simultaneously needs a compiler for 'exec' platform, and a libc for 'target'. This is why a single toolchain resolution matches exec and target simultaneously, and the idea is that the toolchain has some attributes with cfg=exec (for files that are executed as part of build), and cfg=target (for files that are copied opaquely into the output).

I agree that this is not how rules_python uses toolchains. In an alternate reality, we would have a single python toolchain-type that matched on exec and target, and this toolchain would have a python interpreter for both bytecode generation at build-time (matching the 'exec' constraints) and another python interpreter for copying into runfiles (matching 'target' constraints).

I think we're talking in circles, and a github issue is perhaps not the best forum for a discussion. Happy to discuss further on bazel slack, or elsewhere.

BoleynSu commented 1 month ago

As described https://github.com/bazelbuild/bazel/issues/19645#issuecomment-2144311028 in this github comment thread, the toolchain author can't know if their toolchain is going to be used in a test/run (cfg=target), or used directly in an action (cfg=exec). So yes, we would have to pre-emptively declare and configure every 'tool' toolchain twice, just in case.

A toolchain should have a very clear API. For example, if it provides a binary, it should be clearly documented in the API that is should be used as part of the build action or as part of the output artifact.

This is ~exactly what transition and toolchains were designed to support aiui!

The transition solution is completely wrong. See https://github.com/bazel-contrib/rules_oci/pull/590#issuecomment-2200670829 and https://github.com/bazel-contrib/rules_oci/pull/590#issuecomment-2200796524

I think we are confused with two different concepts, 1) toolchain used to run as part of the build action; 2) a (possibly prebuilt) binary that is registered through toolchain.

For 1), suppose a toolchain provides

CcCompileInfo(compile = a function)

where

CcCompileInfo.compile(srcs = "a.cc") returns a DefaultInfo that is a binary.

When defining a cc_toolchain, you can provide a exec binary, e.g., gcc and implment compile as using gcc to compile the source code to an executable. Or you can provide a exec binary, e.g., clang and a target binary lli (https://llvm.org/docs/CommandGuide/lli.html) and implement compile as using clang to compile the source code to LLVM IR and package the lli with the generated IR as the binary. Note that either way, the binaries are never exposed as part of the toolchain's API. The only API is the compile function.

For 2), we basically want a magic binary that works on very supported platform. One can acheive it by using

alias(name="cc-compiler", actual = select(...))

However, this is not flexible enough as one may want to register their own compiler. Then we can use toolchain that accept a cfg=target binary and a special rule to use this toolchain to produce the binary for the target platform.

toolchain_type(name="binary")
toolchain_def(name="def-{platfrom}", binary = "{platfrom}_gcc", target_compatible_with={platfrom})
toolchain(name="cc-{platfrom}", toolchain="def-{platfrom}", toolchain_type="binary")

resolved_cc_binary(name="resolved_cc") # resolved_cc_binary use toolchain_type binary to retrive the binary.

Then the user should just directly use resolved_cc however they want. If they want to use it in build action

_tool = attr.label(cfg="exec", executable=True, default = ":resolved_cc")

If they want to use it in the output artifact

_runner = attr.label(cfg="target", executable=True, default = ":resolved_cc")

bazelbuild / bazel