Open tpudlik opened 1 month ago
Related Slack discussion: https://bazelbuild.slack.com/archives/C01E7TH8XK9/p1720796367796929 (interestingly, it's also an issue with clang, but in a different setting)
To recap the discussion on that thread: my opinion is that, if we're going to fix this, we should do it by providing some sort of API to mark certain source artifacts as "potentially symlinks". For these artifacts, a symlink would be textually sent to the remote execution environment; if it's meant to be resolved remotely, it would be the user's responsibility to ensure that the file at the other end is also present in the action inputs (as well as any other intervening symlinks, if there are multiple layers of indirection). In particular, this probably means that only relative symlinks would be expected to work.
I'm not convinced that we can redefine the behavior for all source artifacts, as it's possible that someone might be using symlinks that can only be resolved locally (and relying on the implicit conversion to a regular file). I'd also prefer to avoid solutions that require Bazel to interpret or transform the text of the symlink in any way.
https://github.com/bazelbuild/bazel/issues/16712 is the missing feature I think we'd need before we can tackle this.
Why is #16712 required? Since "the user's responsibility to ensure that the file at the other end is also present in the action inputs", we don't expect these symlinks to be dangling.
I think it's just difficult to phrase this accurately sometimes. The important distinction is between "Bazel cares about the contents" vs. "Bazel cares solely about the result of readlink()
". In the second case, whether the symlink dangles or not doesn't matter (it looks the same from Bazel's perspective). If the user does want to dereference it, they have that responsibility, but it could equally be used in situations where it's expected to dangle.
Description of the feature request:
If source artifact inputs of build actions include symlinks, these symlinks are represented as regular files when the build action is executed remotely. This can break certain inputs, in particular LLVM built in the "busybox" configuration. The FR is to preserve the symlink structure instead.
Let me unpack this a little bit.
Current Bazel behavior
Let me quote @tjgq from an internal conversation we've had about this:
How this breaks LLVM toolchains
We use a hermetic LLVM toolchain, and that toolchain is part of the build inputs. The toolchain includes a bunch of "binaries" like
bin/clang
,bin/clang++
,bin/lld
, etc. But in fact, the LLVM version we use employs a "busybox" architecture, where these binaries are all symlinks tobin/llvm
. However! Invokingbin/clang
is not actually equivalent to invokingbin/llvm
: the binary examines itsargv[0]
, and behaves differently when invoked via symlink.What is more, in some situations llvm will re-invoke itself. On the first invocation, we need the
argv[0]
to beclang
. On the re-invocation, llvm will use the path from/proc/self/exe
, which needs to end inllvm
. If we merely have a copy,argv[0]
isclang
both times, producing errors like https://pwbug.dev/issues/364781685. I am not a toolchain expert, but I discussed this with some, and they assure me this behavior (i.e., reading/proc/self/exe
an assuming it points tollvm
and not e.g.clang
, rather than just setting it tollvm
) is unfortunately necessary due to the treatment of Clang reproducers and-canonical-prefixes
(although I confess I could not follow their explanation).Workarounds
There are workarounds for this issue:
bin/clang
(etc) with symlinks created byctx.actions.declare_symlink
. Such Bazel-created symlinks will be faithfully sent to RBE.Wrap
bin/clang
(etc) in bash scripts like,This has the advantage that no custom rules are required, you just genrule the wrapper scripts into existence. These wrapper scripts (thanks to the
exec -a clang
) have the same magic property as the symlinks, i.e. thatargv[0]
is different from the actual executed binary basename. However, this requires bash (/bin/sh
doesn't support the-a
flag).However, this is definitely a sharp edge and it would be nice to remove it.
Further reading for Googlers
See internal discussions of this problem for more details:
Which category does this issue belong to?
Remote Execution
What underlying problem are you trying to solve with this feature?
No response
Which operating system are you running Bazel on?
No response
What is the output of
bazel info release
?development version
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.I'm on d62e0a0f32188e1875bb8e62ef4377ea4dc1aab2, fetched by Bazelisk (so, Bazel 8 pre-release).
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
Remarkably, not really, this seems to be a pretty edge-case issue!
Any other information, logs, or outputs that you want to share?
For folks who run into similar issues in the future to find this: the cryptic errors produced by clang are,
To make progress debugging this issue, you need to run clang under
strace
.