py_binary located via manifest-based runfiles lib fails when its own runfiles tree isn't built

pcjanzen commented 4 years ago

Description of the problem / feature request:

Consider a Starlark-generated executable target, which depends at runtime on a py_binary by placing it in its data attr, and which locates the path to that py_binary using a runfiles library that uses runfiles_manifest files.

In this scenario, the target will execute the py_binary using the "real" path of the binary inside bazel-bin, not a path relative to the original target's runfiles.

Bazel doesn't necessarily build the py_binary's own runfiles tree, so the python stub, which only looks at its own sys.argv[0], can't find its module space.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Expand Python-binary-failure-example.zip then run bazel run //:exe

What operating system are you running Bazel on?

macosx

What's the output of `bazel info release`?

release 3.4.1

pauldraper commented 2 years ago

so the python stub, which only looks at its own sys.argv[0], can't find its module space.

The usual fix is to require calling executables to set RUNFILES_DIR environment variable, and then use that if it is set.

rickeylev commented 2 years ago

A few observations.

$(rlocation) is returning a path that is outside of runfiles. $(rlocation workspace/inner_bin) is returning

/home/<user>/.cache/bazel/_bazel_<user>/<hash>/execroot/<workspace>/bazel-out/<config>/bin/inner_bin (I'm calling this the "canonical" path, since its the real file everything else symlinks to)

not e.g. /<snip>/outer_bin.runfiles/<workspace>/inner_bin. (when invoked like this, it works)

If outer_bin has a data dependency on inner_bin, then the official location (wrt outer bin) of inner_bin is $outer_bin_runfiles_dir/<workspace>/inner_bin. Sure, the one under outer bin's runfiles is a symlink, but this still seems problematic (also, this sort of thing could be viewed as a potential path-escape vulnerability, as you're forced to accept arbitrary paths that may not actually be in runfiles and shouldn't be accessible). FWIW, internally at Google, I think I've only ever seen the runfiles-relative paths used in e.g. shell programs, e.g. `$RUNFILES/inner_bin --whatever", and, similarly, internally, we always use a runfiles-relative path for accessing runfiles data.

Otherwise, to make this work, the runfiles directory of inner_bin needs to be materialized in the blaze-out directory along side the actual binary. When the canonical path is invoked directly, it's lost any context that it's within another binary's runfiles. This would probably be more correct anyways, as it would prevent the outer bin's runfiles from being accessible to the inner bin. That said, that might be a hard change because for the inner binary to find its correct runfiles, it'll have to find its canonical executable (e.g. realpath argv[0]).

(Note I'm talking about a single extra runfiles at the top level where the canonical bin executable is, not a runfiles tree within another runfiles tree (which could then have more runfiles trees within it))

Such a change seems like it is at odds with the existing guidance in this area, though? e.g. that "{outer}.runfiles/inner" is the official location, and that using the outer bin's runfiles for the inner's is the suggested fix/workaround.

The logic in the stub script has logic to match any *.runfiles src; I'm willing to bet that logic is there to make binaries-nested-in-binaries Just Work without additional envvar setup.

The stub script doesn't appear to care about RUNFILES_DIR or RUNFILES_MANIFEST_FILE, at least not during FindModuleSpace(), the function that looks for the runfiles directory. Some later logic does, but they don't appear to influence locating the runfiles for the stub script startup. That sounds like a bug; isn't the point of those envvars to explicitly tell the runfiles location? Making FindModuleSpace() also check those looks like a simple change. This seems like an obvious oversight, though, so it makes me wonder if there's something in the change history that might explain why it doesn't.

pcjanzen commented 2 years ago

Right. I tried to avoid assigning blame here, because this bug requires all of the following to be true:

Bazel emits the "canonical path" for inner_bin into the runfiles manifest for outer_bin.
Bazel does not materialize the canonical runfiles for an inner_bin that is a py_binary when building outer_bin. (Apparently, it does materialize it when outer_bin merges the py_binary's data_runfiles, instead of its default_runfiles, but the docs explicitly discourage this).
The runfiles resolution prefers the manifest to the runfiles_dir when both are present.
The python stub script does not honor $RUNFILES_DIR in FindModuleSpace().

and all of these probably make sense in some context.

The stub script doesn't appear to care about RUNFILES_DIR or RUNFILES_MANIFEST_FILE

If the stub script preferred $RUNFILES_MANIFEST_FILE over $RUNFILES_DIR, like current runfiles libraries do, then it would wind up in the same situation. Same with the RunfilesEnvvar function already in the stub.

rickeylev commented 2 years ago

Apparently, it does materialize it when outer_bin merges the py_binary's data_runfiles

Oh huh, yes, I didn't notice that. That is weird. They have the same contents (as printed from starlark); I don't see any difference. Additionally, if the contents of the data runfiles are copied into a new runfiles object[1], then it breaks again.

So it's almost like the original, Java created, data_runfiles Runfiles object is carrying some extra bit that later causes the runfiles dir to be created?

[1] like so, though I'll note only .files has any values in my repro (based upon the repro code in issue 14608)

    o_data = ctx.attr._tool[DefaultInfo].data_runfiles
    new_data = ctx.runfiles(
        transitive_files = o_data.files,
        symlinks = {v.path: v.target_file for v in o_data.symlinks.to_list()},
        root_symlinks = {v.path: v.target_file for v in o_data.root_symlinks.to_list()},
    )
    <merge new_data into final runfiles>

pcjanzen commented 2 years ago

If i'm reading this correctly, only the data_runfiles depend on this middleman: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/rules/python/PyExecutable.java;l=87

rickeylev commented 2 years ago

Maybe this has something to do with the runfiles "middlemen". I was never able to really figure out what these middlemen things are, but the comments suggest something about tying together the owning executable and necessary files:

This appears related to the --experimental_build_transitive_python_runfiles flag. That flag defaults to true. if set to false, then using the original data_runfiles also results in a failure.

So, yeah, definitely seems like that bit of logic is the key piece.

pauldraper commented 2 years ago

The stub script doesn't appear to care about RUNFILES_DIR or RUNFILES_MANIFEST_FILE, at least not during FindModuleSpace()...That sounds like a bug; isn't the point of those envvars to explicitly tell the runfiles location? Making FindModuleSpace() also check those looks like a simple change. This seems like an obvious oversight, though, so it makes me wonder if there's something in the change history that might explain why it doesn't.

Agreed.

rickeylev commented 2 years ago

Maybe this has something to do with the runfiles "middlemen".

Well, I found some more info about this and it seems in the same vein as this bug.

Basically, there is something special about that middleman object that causes an inner binary's runfiles to be updated (on disk) when an outer binary is built. Back in March 2015, lberki attempted to remove the middleman from the data_runfiles. It ended up causing a breakage, so it was added back a couple months later. At some later point, the --experimental_build_transitive_python_runfiles flag was introduced (in a seemingly unrelated CL), and then later it became a guard for adding the middleman to data_runfiles (the change history for this part is a bit murky). On the plus side, I think I now understand what that flag name is trying to communicate: whether the runfiles directory of a transitive binary itself should also be updated.

The breakage in question was pretty obscure: people were building an inner binary, modifying it (e.g. changing deps) (but not rebuilding it), then building an outer binary, and expecting that the outer binary's invocation of the inner binary would work (as if you had re-built inner as well). e.g.

# outer.py basically just does subprocess("$runfiles_dir/inner")
py_binary(outer, srcs=[outer.py], data=[inner])
# inner.py basically just does "import old.py"
py_binary(inner, srcs=[inner.py, old.py])

# build :inner -> after this, inner.runfiles has "old.py" in it

# replace "old.py" with "new.py"; change inner.py to "import new" instead
# build :outer -> after this, outer.runfiles has new.py in it due to runfiles merging
blaze-bin/outer

This would work because, when the outer binary built the inner binary, the middleman object causes the inner's runfiles directory from the first build (where only inner was built) to also be updated (i.e., new.py shows up). Without the middleman, then the inner's runfiles directory isn't updated (i.e new.py doesn't show up) (but the merged runfiles of the outer's is) when outer is built.

This particular use case failing might be specific to our Google build, but haven't checked. Google's python_stub_template.txt will look for a runfiles directory after resolving symlinks, while Bazel's looks for a runfiles directory before resolving symlinks. I think the net effect means, with Bazel, it see's the "outer.runfiles" in the path and uses that, while with Google's build, it "escapes" outside of "outer.runfiles" and then starts looking at the inner.runfiles directory directly.

bazelbuild / bazel