bazelbuild / rules_python

Bazel Python Rules
https://rules-python.readthedocs.io
Apache License 2.0
538 stars 542 forks source link

Toolchain uses target_compatible_with for platform compatibility #1739

Open matzipan opened 9 months ago

matzipan commented 9 months ago

🐞 bug report

Affected Rule

python_toolchain_build_file_content

Is this a regression?

Unsure

Description

When it configures py_toolchain_suite, it uses target_compatible_with to configure the platform compatibilties.

It should instead use exec_compatible_with.

Yes, on most systems where rules_python runs this is one and the same. But if I want to use rules_python in my cross compilation toolchain, I can't do it. target_compatible_with would say it's incompatible. See here.

The following change allows me to do that:

diff --git a/python/private/toolchains_repo.bzl b/python/private/toolchains_repo.bzl
index c586208..38648ee 100644
--- a/python/private/toolchains_repo.bzl
+++ b/python/private/toolchains_repo.bzl
@@ -63,11 +63,11 @@ def python_toolchain_build_file_content(
 py_toolchain_suite(
     user_repository_name = "{user_repository_name}_{platform}",
     prefix = "{prefix}{platform}",
-    target_compatible_with = {compatible_with},
+    exec_compatible_with = {exec_compatible_with},
     python_version = "{python_version}",
     set_python_version_constraint = "{set_python_version_constraint}",
 )""".format(
-            compatible_with = meta.compatible_with,
+            exec_compatible_with = meta.exec_compatible_with,
             platform = platform,
             set_python_version_constraint = set_python_version_constraint,
             user_repository_name = user_repository_name,
diff --git a/python/versions.bzl b/python/versions.bzl
index 9a28f15..11aa267 100644
--- a/python/versions.bzl
+++ b/python/versions.bzl
@@ -408,7 +408,7 @@ MINOR_MAPPING = {

 PLATFORMS = {
     "aarch64-apple-darwin": struct(
-        compatible_with = [
+        exec_compatible_with = [
             "@platforms//os:macos",
             "@platforms//cpu:aarch64",
         ],
@@ -418,7 +418,7 @@ PLATFORMS = {
         arch = "arm64",
     ),
     "aarch64-unknown-linux-gnu": struct(
-        compatible_with = [
+        exec_compatible_with = [
             "@platforms//os:linux",
             "@platforms//cpu:aarch64",
         ],
@@ -429,7 +429,7 @@ PLATFORMS = {
         arch = "aarch64",
     ),
     "ppc64le-unknown-linux-gnu": struct(
-        compatible_with = [
+        exec_compatible_with = [
             "@platforms//os:linux",
             "@platforms//cpu:ppc",
         ],
@@ -440,7 +440,7 @@ PLATFORMS = {
         arch = "ppc64le",
     ),
     "s390x-unknown-linux-gnu": struct(
-        compatible_with = [
+        exec_compatible_with = [
             "@platforms//os:linux",
             "@platforms//cpu:s390x",
         ],
@@ -451,7 +451,7 @@ PLATFORMS = {
         arch = "s390x",
     ),
     "x86_64-apple-darwin": struct(
-        compatible_with = [
+        exec_compatible_with = [
             "@platforms//os:macos",
             "@platforms//cpu:x86_64",
         ],
@@ -459,7 +459,7 @@ PLATFORMS = {
         arch = "x86_64",
     ),
     "x86_64-pc-windows-msvc": struct(
-        compatible_with = [
+        exec_compatible_with = [
             "@platforms//os:windows",
             "@platforms//cpu:x86_64",
         ],
@@ -467,7 +467,7 @@ PLATFORMS = {
         arch = "x86_64",
     ),
     "x86_64-unknown-linux-gnu": struct(
-        compatible_with = [
+        exec_compatible_with = [
             "@platforms//os:linux",
             "@platforms//cpu:x86_64",
         ],

🔬 Minimal Reproduction

Define a platform which doesn't match your host platform and try to use it, as if for cross-compilation.

🔥 Exception or Error

No matching toolchain.

🌍 Your Environment

Operating System:

Linux, x86_64

Output of bazel version:

6.3.2

Rules_python version:

0.29.0

Anything else relevant?

aignas commented 9 months ago

Right now rules_python does not strictly support cross compilation and it seems that rules_pycross can use our toolchain for that.

I wonder about how your cross compilation toolchain achieves that? What other modifications do you have to make to get it to work?

matzipan commented 9 months ago

I think there is maybe a confusion in terms. Just to make sure we are using the same: in my environment python only ever runs on one host/execution platform, which is my workstation. There is no cross aspect to it.

But in the build chain, I have compilation targets which are configured for other target platforms. Let's say a binary that is compiled for a different OS or architecture that I then manipulate with a python tool. Or a binary that I then run inside a QEMU target launched from python.

The only modification I needed to make this work was this one above.

aignas commented 9 months ago

Thanks for the explanation. I think your patch makes sense, would you like to submit a PR for it?

On 1 February 2024 08:02:41 GMT+09:00, Andrei @.***> wrote:

I think there is maybe a confusion in terms. Just to make sure we are using the same: in my environment python only ever runs on one host/execution platform, which is my workstation. There is no cross aspect to it.

But in the build chain, I have compilation targets which are configured for other target platforms. Let's say a binary that is compiled for a different OS or architecture that I then manipulate with a python tool. Or a binary that I then run inside a QEMU target launched from python.

The only modification I needed to make this work was this one above.

-- Reply to this email directly or view it on GitHub: https://github.com/bazelbuild/rules_python/issues/1739#issuecomment-1920138084 You are receiving this because you commented.

Message ID: @.***>

rickeylev commented 9 months ago

Hol up. TCW is the correct setting. I'll post again shortly to explain why.

jvolkman commented 9 months ago

See #704 and #747. We already switched from exec to target once.

aignas commented 9 months ago

Ah, thank you for the context. I forgot about that PR, but now I remember. And sorry for confusion here, everybody.

I guess then the solution to the problem would have to be different. I think the fact that we have the toolchain as one monolithic thing could be an issue here. We may want to look at how the toolchain and the runtime is split in the rules_jvm so that the right runtime is included in py_binary targets, but we could still run the interpreter that works on the host platform when targetting other platforms.

EDIT: I'll let Richard to explain it better than I can.

rickeylev commented 9 months ago

TCW is correct because the configuration of the files in the toolchain are using cfg=target, not cfg=exec. This means, when a rule looks up the toolchain, things like PyRuntimeInfo.files aren't guaranteed to be compatible with whatever execution platform was selected during toolchain resolution. In order to get an exec-compatible artifact, the files in the toolchain need to come from a cfg=exec source.

If TCW was changed to ECW, things would break quite badly, as targeting the mac platform may suddenly try to use a linux interpreter.

Also, remember that "target platform" means "the current configuration's target platform", which changes depending on if a target is in the target or exec configuration. i.e. a build tool's target platform is the platform selected by the exec config. Stated another way: if you want to run a Python program as a build action, creating a py_binary and putting in an attribute with cfg=exec will cause the py_binary's toolchain lookups to use whatever the exec platform is.

I think the fact that we have the toolchain as one monolithic thing could be an issue here.

Yes.

The short version is: if you want an exec-mode interpreter, then you have two options:

  1. Rather that directly using the toolchain, use a binary with cfg=exec. This is usually what you want to do for a build tool anyways. Binaries are bazel's abstraction to create executables and can be passed directly to actions. It's also more flexible, scheduling actions wise (see below).
  2. A new toolchain type needs to be created with TCW empty, and ECW set appropriately. The platform restrictions are obvious; I'm not sure what, if any, of the python-version constraints they would have

It seems likely people are fond of directly using the toolchain in actions? I don't really understand why instead of doing (1), but in any case, I'd be +1 on adding (2).

To understand the "why" of some of this:

With toolchains, the ECW part basically means, "This tool that must run on $ECW and can create output for $TCW" (if TCW is empty, it means "any platform"). When both TCW and ECW are set, it tightens the statement to "This tool must run on $ECW and can only generate output for $TCW".

For the interpreter itself, the latter statement doesn't make much sense -- the interpreter doesn't do anything on its own; it could generate output for any platform.

You might think: What if we just set both to the same value? Then running the exec-config interpreter produces the same values. This sort of works, but is still strictly incorrect (mixing config types), and also makes using multiple other toolchains more problematic and cross builds need a combinatorical explosion of toolchain definitions. Part of the way toolchain resolution works is a single (target, exec) tuple that is compatible with all the toolchains a rule needs is sought after. By setting e.g. (tcw=linux, ecw=linux), then only a linux host can use that toolchain and all other toolchains must also have a linux-host capable toolchain. This then creates a combinatorical explosion of toolchains for cross builds: to cross build target=linux, host=mac, there needs to be a (tcw=linux, ecw=mac) combination etc etc.

matzipan commented 9 months ago

Thank you all for the in depth response. I understand now the philosophy of it.

It sort of makes sense. I was not aware of the cfg param. Will give it a shot and report back.

Thanks for all the nice work.

matzipan commented 9 months ago

Hm, okay so cfg="exec" is only accessible at rule definition time. Which means one would have to reimplement all of this just to change that flag.

Second option of having two toolchain definitions, perhaps configurable in some way, then becomes more interesting.

rickeylev commented 9 months ago

reimplement all of this just to change that flag.

I don't follow. I don't see any actions in that file? Is something doing one elsewhere?

If you want to run a rules_py py_binary as an action, then you just need to put cfg=exec on the attribute that is consuming the target. This isn't a special behavior of rules_python or rules_py; it's just how attributes for a rule are defined for any Bazel rule.

I know rules_py tries to setup a venv, but I don't know much about venv setup or how closely coupled venvs are to the target configuration (i.e. interpreter); I'm assuming fairly closely coupled, as that is rather common with Python tooling. What I can say is that you can't assume the exec and target configuration are the same, and there isn't really a way to force them to be.

Ideally, you want actions to be independent of the target config and take arguments that tell the action about the target config to generate the correct output. e.g. pass --max-int as an argument to a tool instead of using sys.maxint at runtime.

About the closest you can get to forcing them to be the same is to have a toolchain with the same TCW and ECW settings. That will force the two configs to be compatible, insofar as those constraints are concerned (but that it doesn't mean the two configs are the same, they could vary in other ways). There are also exec_groups, which can be used to force actions into predefined constraints, but that isn't the same as using the target config. Ah, I almost forget, also auto exec groups (which I haven't used, but docs indicate it may be of use for such a case).

matzipan commented 9 months ago

If you want to run a rules_py py_binary as an action, then you just need to put cfg=exec on the attribute that is consuming the target

Above, I mentioned two examples:

What I can say is that you can't assume the exec and target configuration are the same, and there isn't really a way to force them to be.

We're all in agreement here. The current platform implementation is a simple concept, but this simplicity brings a lack of expressivity which we seem to be hitting in this case.

novas0x2a commented 3 months ago

Yeah, chiming in here, a side-effect of this is demonstrated in this repo (distilled down from a larger internal setup): https://github.com/novas0x2a/rules-python-host (Below examples are from commit id https://github.com/novas0x2a/rules-python-host/commit/9fbc1fdd2f9447bfe8cf53a2acac43bd2be4e307 in case I modify the repo)

Context: I'm on my linux-amd64 laptop. I'm trying to target arm64:

$ bazel test ...
INFO: Analyzed 4 targets (0 packages loaded, 4041 targets configured).
INFO: Found 2 targets and 2 test targets...
INFO: Elapsed time: 2.897s, Critical Path: 2.61s
INFO: 15 processes: 11 internal, 2 linux-sandbox, 2 local.
INFO: Build completed successfully, 15 total actions
//:pip_test                                                              PASSED in 2.3s
//:test                                                                  PASSED in 0.7s

Executed 2 out of 2 tests: 2 tests pass.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.

$ bazel test --platforms :arm64 ...
WARNING: Build option --platforms has changed, discarding analysis cache (this can be expensive, see https://bazel.build/advanced/performance/iteration-speed).
INFO: Analyzed 4 targets (0 packages loaded, 4033 targets configured).
FAIL: //:pip_test (see /home/mlundy/.cache/bazel/_bazel_mlundy/49f7ad76792f58b5e5139cb5fb9829d8/execroot/__main__/bazel-out/k8-fastbuild/testlogs/pip_test/test.log)
INFO: From Testing //:pip_test:
==================== Test output for //:pip_test:
aarch64-binfmt-P: Could not open '/lib/ld-linux-aarch64.so.1': No such file or directory
================================================================================
FAIL: //:test (see /home/mlundy/.cache/bazel/_bazel_mlundy/49f7ad76792f58b5e5139cb5fb9829d8/execroot/__main__/bazel-out/k8-fastbuild/testlogs/test/test.log)
INFO: From Testing //:test:
==================== Test output for //:test:
aarch64-binfmt-P: Could not open '/lib/ld-linux-aarch64.so.1': No such file or directory
================================================================================
INFO: Found 2 targets and 2 test targets...
INFO: Elapsed time: 0.646s, Critical Path: 0.45s
INFO: 15 processes: 11 internal, 2 linux-sandbox, 2 local.
INFO: Build completed, 2 tests FAILED, 15 total actions
//:pip_test                                                              FAILED in 0.1s
  /home/mlundy/.cache/bazel/_bazel_mlundy/49f7ad76792f58b5e5139cb5fb9829d8/execroot/__main__/bazel-out/k8-fastbuild/testlogs/pip_test/test.log
//:test                                                                  FAILED in 0.0s
  /home/mlundy/.cache/bazel/_bazel_mlundy/49f7ad76792f58b5e5139cb5fb9829d8/execroot/__main__/bazel-out/k8-fastbuild/testlogs/test/test.log

Executed 2 out of 2 tests: 2 fail locally.

Due to the lack of an exec_compatible_with, it selects a python toolchain that's incompatible with the current exec platform.

aignas commented 3 months ago

This is working as intended, you need to have a test executor that supports arm64 when executing the bazel test command, which may require an RBE setup.

If you still want to build arm64 binaries, but test it on an amd64 machine, you would need to use transitions to transition your binaries to arm64 configuration and test on amd64.

On 8 August 2024 02:16:56 EEST, Mike Lundy @.***> wrote:

Yeah, chiming in here, a side-effect of this is demonstrated in this repo (distilled down from a larger internal setup): https://github.com/novas0x2a/rules-python-host (Below examples are from commit id https://github.com/novas0x2a/rules-python-host/commit/9fbc1fdd2f9447bfe8cf53a2acac43bd2be4e307 in case I modify the repo)

Context: I'm on my linux-amd64 laptop. I'm trying to target arm64:

$ bazel test ...
INFO: Analyzed 4 targets (0 packages loaded, 4041 targets configured).
INFO: Found 2 targets and 2 test targets...
INFO: Elapsed time: 2.897s, Critical Path: 2.61s
INFO: 15 processes: 11 internal, 2 linux-sandbox, 2 local.
INFO: Build completed successfully, 15 total actions
//:pip_test                                                              PASSED in 2.3s
//:test                                                                  PASSED in 0.7s

Executed 2 out of 2 tests: 2 tests pass.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.
***@***.***:~/Code/novas0x2a/rules-python-host> bazel test --platforms :arm64 ...
WARNING: Build option --platforms has changed, discarding analysis cache (this can be expensive, see https://bazel.build/advanced/performance/iteration-speed).
INFO: Analyzed 4 targets (0 packages loaded, 4033 targets configured).
FAIL: //:pip_test (see /home/mlundy/.cache/bazel/_bazel_mlundy/49f7ad76792f58b5e5139cb5fb9829d8/execroot/__main__/bazel-out/k8-fastbuild/testlogs/pip_test/test.log)
INFO: From Testing //:pip_test:
==================== Test output for //:pip_test:
aarch64-binfmt-P: Could not open '/lib/ld-linux-aarch64.so.1': No such file or directory
================================================================================
FAIL: //:test (see /home/mlundy/.cache/bazel/_bazel_mlundy/49f7ad76792f58b5e5139cb5fb9829d8/execroot/__main__/bazel-out/k8-fastbuild/testlogs/test/test.log)
INFO: From Testing //:test:
==================== Test output for //:test:
aarch64-binfmt-P: Could not open '/lib/ld-linux-aarch64.so.1': No such file or directory
================================================================================
INFO: Found 2 targets and 2 test targets...
INFO: Elapsed time: 0.646s, Critical Path: 0.45s
INFO: 15 processes: 11 internal, 2 linux-sandbox, 2 local.
INFO: Build completed, 2 tests FAILED, 15 total actions
//:pip_test                                                              FAILED in 0.1s
 /home/mlundy/.cache/bazel/_bazel_mlundy/49f7ad76792f58b5e5139cb5fb9829d8/execroot/__main__/bazel-out/k8-fastbuild/testlogs/pip_test/test.log
//:test                                                                  FAILED in 0.0s
 /home/mlundy/.cache/bazel/_bazel_mlundy/49f7ad76792f58b5e5139cb5fb9829d8/execroot/__main__/bazel-out/k8-fastbuild/testlogs/test/test.log

Executed 2 out of 2 tests: 2 fail locally.

Due to the lack of an exec_compatible_with, it selects a python toolchain that's incompatible with the current exec platform.

-- Reply to this email directly or view it on GitHub: https://github.com/bazelbuild/rules_python/issues/1739#issuecomment-2274502869 You are receiving this because you commented.

Message ID: @.***>

rickeylev commented 3 months ago

This is WAI

Yeah. It looks like it's building an arm64 binary, as requested by the --platforms=arm64 flag, but then running on the local host, which is x86.

If you still want to build arm64 binaries, but test it on an amd64 machine, you would need to use transitions to transition your binaries to arm64 configuration and test on amd64.

That probably would be the easier way, yes. RBE would be the other way.

You could try to invert the relationship (e.g. a cfg=exec (x86) action runs an emulator tool that takes a cfg=target binary (the actual test, in arm64, set by --platforms) to run under emulation, then figure out how to handle the output of running the emulator).

But note that this doesn't have much to do with the toolchain TCW/ECW settings. I think it might matter if a custom rule is using multiple toolchains whose TCW/ECW settings don't get along (or deceitful things like the autodetecting toolchain), but the issue there is in how the custom rule is consuming toolchains, not in a toolchain's TCW/ECW settings.

novas0x2a commented 3 months ago

Note that this is being triggered by py_test on arch-independent code; if it selected the correct toolchain it should work.

aignas commented 3 months ago

I think it could be possible to tell bazel that your py_test should be always transition to the host platform and @platforms//host might help here. In general py_test is not necessary arch-independent and it sometimes depends on deps that are arch dependent, so although in your case things would/could work, in the generic case if you wanted to build a docker container of a py_test target so that you can run the test inside docker as part of your bazel test (i.e. sh_binary calling the dockerized py_test), then you would be in trouble.

novas0x2a commented 3 months ago

I think it could be possible to tell bazel that your py_test should be always transition to the host platform and @platforms//host might help here. In general py_test is not necessary arch-independent and it sometimes depends on deps that are arch dependent

Yeah, in my above example I oversimplified, but in the original thing it's derived from, we do have both types of test. Currently I have to mark all of the py_tests target_compatible_with = HOST_CONSTRAINTS but that basically makes cross-compiling useless. Ideally:

rickeylev commented 3 months ago

"arch independent" is a bit of a hard concept for Bazel to model. About the closest you can do is @platforms//os:any (or is it "all", or "cpu:any"? I can't remember exactly). Note that, when using the hermetic runtimes, py_test isn't arch independent because it includes a prebuilt runtime for a particular platform.

For an example of writing a rule that performs a transition, look at python/config_settings/transition.bzl. It's fairly simple: perform a transition, then symlink some of the outputs of the wrapped target.

novas0x2a commented 3 months ago

Note that, when using the hermetic runtimes, py_test isn't arch independent because it includes a prebuilt runtime for a particular platform

Sorry, I was unclear, I meant that all of the explicit inputs were arch-independent, the only arch-dependent bits are the toolchain / runtime. (I understand that this isn't true all the time, but it's true about half the time for us).

For an example of writing a rule that performs a transition, look at python/config_settings/transition.bzl. It's fairly simple: perform a transition, then symlink some of the outputs of the wrapped target.

Sorry to be a pest, but, I don't understand how to do this in practice. The transition isn't the hard part (I've written transitions before, also, aspects's rules for binaries and filegroups work for those providers; it's just painful/fragile that one needs to write a new transition rule for every type of provider combination one might want to pass on...)

The issue is the rest of it. What do I transition, what might that look like when applied to my toy repo?, fixing both the py_test and the compile_pip_requirements? That's the part I don't understand.

groodt commented 3 months ago

Does this post resonate with you @novas0x2a https://github.com/bazelbuild/bazel/issues/19645

aignas commented 3 months ago

One idea how to achieve a py_host_test would be to:

import native_test from skylib
import transition_filegroup from aspect

def py_host_test(name, **kwargs):
    py_binary(name + ".input", testonly=True, **kwargs)
    asects_transition_rule(name + ".transitioned", src = name + ".input", target_platform = "@platforms//host")
    native_test(name, src = name + ".transitioned")

However, maybe it would be better to lift the example from rules_platform and not use --platforms in bazelrc:


cc_binary(name = "foo")

platform_data(
    name = "foo_embedded",
    target = ":foo",
    platform = "//my/new:platform",
)

py_test(
    name = "foo_test",
    srcs = ...,
    data = [
        ":foo_embedded",
    ],
)```
novas0x2a commented 3 months ago

(sorry, was away from this for a few days) @groodt yeah, it does. I can see why rules_python chose to do it this way, even if it's personally annoying for me that it causes bazel to choose a toolchain that doesn't work on my exec platform :) It doesn't seem like there's a Correct Answer right now.

@aignas thanks a ton, I get it now. For those who stumble upon this later, here's the diff I made to my sample project that made it work. Be careful about just copying it, I didn't ensure that I was handling all possible args correctly (just enough to prove it worked), so there are probably some that would get passed to the wrong rule.

Re: platform_data vs platform_transition_binary, I think they both use the same //command_line_option:platforms-based transition (platform_data / aspect-lib)

brentleyjones commented 2 months ago

Doesn't it just need to register both target and exec constraints to the same thing (unlike before this change which only had exec constraints)? I'm running into an issue where an exec tool is incorrectly bundling a target interpreter in a cross-platform build.

Edit: Doing that didn't fix my issue, something else if preventing the exec platform from being correct for me. Can ignore for now.