bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.2k stars 4.06k forks source link

Preserve symlink structure of remote execution inputs #23620

Open tpudlik opened 1 month ago

tpudlik commented 1 month ago

Description of the feature request:

If source artifact inputs of build actions include symlinks, these symlinks are represented as regular files when the build action is executed remotely. This can break certain inputs, in particular LLVM built in the "busybox" configuration. The FR is to preserve the symlink structure instead.

Let me unpack this a little bit.

Current Bazel behavior

Let me quote @tjgq from an internal conversation we've had about this:

  1. Every artifact tracked by Bazel has a type (file, directory or symlink) associated with it, which defines how its identity is determined - for artifacts of file or directory type, the content matters; for artifacts of symlink type, only the textual target path of the symlink does

  2. However (and this is where it gets complicated) the "Bazel type" of an artifact is somewhat independent from its filesystem representation. It's possible to have an artifact of file or directory type which is materialized in the filesystem as a symlink, in which case Bazel cares about the bytes at the other end of the symlink, not the path the symlink points to

  3. Source artifacts are, at the moment, always of file type; only output artifacts can be of directory or symlink type (that would be a ctx.actions.declare_directory or a ctx.actions.declare_symlink, respectively)

  4. For artifacts of file type, when we construct the input representation of a remote action, we always represent them as regular files, never symlinks, even if their filesystem representation is a symlink; so their "symlinkness" is effectively lost when the action is executed remotely

How this breaks LLVM toolchains

We use a hermetic LLVM toolchain, and that toolchain is part of the build inputs. The toolchain includes a bunch of "binaries" like bin/clang, bin/clang++, bin/lld, etc. But in fact, the LLVM version we use employs a "busybox" architecture, where these binaries are all symlinks to bin/llvm. However! Invoking bin/clang is not actually equivalent to invoking bin/llvm: the binary examines its argv[0], and behaves differently when invoked via symlink.

What is more, in some situations llvm will re-invoke itself. On the first invocation, we need the argv[0] to be clang. On the re-invocation, llvm will use the path from /proc/self/exe, which needs to end in llvm. If we merely have a copy, argv[0] is clang both times, producing errors like https://pwbug.dev/issues/364781685. I am not a toolchain expert, but I discussed this with some, and they assure me this behavior (i.e., reading /proc/self/exe an assuming it points to llvm and not e.g. clang, rather than just setting it to llvm) is unfortunately necessary due to the treatment of Clang reproducers and -canonical-prefixes (although I confess I could not follow their explanation).

Workarounds

There are workarounds for this issue:

  1. Replace bin/clang (etc) with symlinks created by ctx.actions.declare_symlink. Such Bazel-created symlinks will be faithfully sent to RBE.
  2. Wrap bin/clang (etc) in bash scripts like,

    #!/bin/bash
    exec -a clang $(location //:bin/llvm)  "$@"

    This has the advantage that no custom rules are required, you just genrule the wrapper scripts into existence. These wrapper scripts (thanks to the exec -a clang) have the same magic property as the symlinks, i.e. that argv[0] is different from the actual executed binary basename. However, this requires bash (/bin/sh doesn't support the -a flag).

However, this is definitely a sharp edge and it would be nice to remove it.

Further reading for Googlers

See internal discussions of this problem for more details:

Which category does this issue belong to?

Remote Execution

What underlying problem are you trying to solve with this feature?

No response

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

development version

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

I'm on d62e0a0f32188e1875bb8e62ef4377ea4dc1aab2, fetched by Bazelisk (so, Bazel 8 pre-release).

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

Remarkably, not really, this seems to be a pretty edge-case issue!

Any other information, logs, or outputs that you want to share?

For folks who run into similar issues in the future to find this: the cryptic errors produced by clang are,

clang: error: unknown argument: '-cc1'
clang: error: unknown argument '-triple'; did you mean '-Xclang -triple'?
clang: error: unknown argument '-clear-ast-before-backend'; did you mean '-Xclang -clear-ast-before-backend'?
clang: error: unknown argument '-main-file-name'; did you mean '-Xclang -main-file-name'?
clang: error: unknown argument '-mrelocation-model'; did you mean '-Xclang -mrelocation-model'?
clang: error: unknown argument '-pic-level'; did you mean '-Xclang -pic-level'?
clang: error: unknown argument '-fhalf-no-semantic-interposition'; did you mean '-Xclang -fhalf-no-semantic-interposition'?
clang: error: unknown argument '-mframe-pointer=all'; did you mean '-Xclang -mframe-pointer=all'?
clang: error: unknown argument '-nostdsysteminc'; did you mean '-Xclang -nostdsysteminc'?
clang: error: unknown argument '-target-cpu'; did you mean '-Xclang -target-cpu'?
clang: error: unknown argument '-target-feature'; did you mean '-Xclang -target-feature'?
clang: error: unknown argument '-target-feature'; did you mean '-Xclang -target-feature'?
clang: error: unknown argument '-target-feature'; did you mean '-Xclang -target-feature'?
clang: error: unknown argument '-crc'; did you mean '-mcrc'?
clang: error: unknown argument '-target-feature'; did you mean '-Xclang -target-feature'?
clang: error: unknown argument '-target-feature'; did you mean '-Xclang -target-feature'?
clang: error: unknown argument: '-ras'
clang: error: unknown argument '-target-feature'; did you mean '-Xclang -target-feature'?
clang: error: unknown argument: '-sb'

To make progress debugging this issue, you need to run clang under strace.

tjgq commented 1 month ago

Related Slack discussion: https://bazelbuild.slack.com/archives/C01E7TH8XK9/p1720796367796929 (interestingly, it's also an issue with clang, but in a different setting)

To recap the discussion on that thread: my opinion is that, if we're going to fix this, we should do it by providing some sort of API to mark certain source artifacts as "potentially symlinks". For these artifacts, a symlink would be textually sent to the remote execution environment; if it's meant to be resolved remotely, it would be the user's responsibility to ensure that the file at the other end is also present in the action inputs (as well as any other intervening symlinks, if there are multiple layers of indirection). In particular, this probably means that only relative symlinks would be expected to work.

I'm not convinced that we can redefine the behavior for all source artifacts, as it's possible that someone might be using symlinks that can only be resolved locally (and relying on the implicit conversion to a regular file). I'd also prefer to avoid solutions that require Bazel to interpret or transform the text of the symlink in any way.

tjgq commented 1 month ago

https://github.com/bazelbuild/bazel/issues/16712 is the missing feature I think we'd need before we can tackle this.

tpudlik commented 1 month ago

Why is #16712 required? Since "the user's responsibility to ensure that the file at the other end is also present in the action inputs", we don't expect these symlinks to be dangling.

tjgq commented 1 month ago

I think it's just difficult to phrase this accurately sometimes. The important distinction is between "Bazel cares about the contents" vs. "Bazel cares solely about the result of readlink()". In the second case, whether the symlink dangles or not doesn't matter (it looks the same from Bazel's perspective). If the user does want to dereference it, they have that responsibility, but it could equally be used in situations where it's expected to dangle.