Open dmivankov opened 4 years ago
extra info
$ bazel build :main
$ bazel run --shell_executable=$(pwd)/bazel-bin/main :main2 -- "d e"
1: -c
2: SOME_PATH/execroot/__main__/bazel-out/k8-fastbuild/bin/main2 a b c 'd e'
also
BUILD file args=[" 'x' "] -> argv = ["x"]
bazel run -- " 'x' " -> argv = [" 'x' "]
Looks like the issue is https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java#L178 calling https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/RunfilesSupport.java#L429 which has ".tokenized()" call inside
Not sure who owns RunfilesSupport, but this looks like a real logic bug.
To my reading, https://docs.bazel.build/versions/master/be/common-definitions.html#binary.args specifically says that args
undergoes Bourne shell tokenization, which explaining the splitting. I don't understand what the bug is?
May be just a documentation issue then https://docs.bazel.build/versions/master/be/common-definitions.html#sh-tokenization
Certain string attributes of some rules are split into multiple words according to the tokenization rules of the Bourne shell
while https://docs.bazel.build/versions/master/be/common-definitions.html#binary.args
args | List of strings;
Intuitively, when you supply array of strings, you don''t expect split and combine the lists behavior, also sh-tokenization is said to apply to string arguments, not to lists of string.
If I recall correctly, this has come up a few times before. I agree that the behavior should be changed to avoid shell tokenization; however, this is a difficult transition. It's not enough to have a flag because there's no way to write args in a way that is compatible with and without shell tokenization.
One way to migrate is
args = ["--foo"] + tokenize(something) + ["abc"]
# or
args = ["--foo"] + tokenize(something1, something2) + ["abc"]
which would work regardless of flag given that there are no splits in free args
We also probably need builtin quote(string_with_spaces_or_quotes): string -> string So canonical arguments would be
args = ["--foo", "abc", quote(" x "), quote("'"), quote("\'x\'")] + tokenize("a \"b b\"") + tokenize("c d", "e f")
# meaning
argv = ["--foo", "abc", " x ", "'", "\'x\'", "a", "b b", "c", "d", "e", "f"]
and rejected arguments would be
args = ["a b", " x ", "\"x\"", "\'y\'", "\"x y\"", "a \"b\" c" ] + tokenize("a \" b")
Note that currently unquoting seems to do funny and unpredictable things
args = ["a", " \"b c\" ", " \"\"b\" c\" ", "'a ' ", "'d'", "\"e\"", "\"\"x\"\"", "\"'y'\"", "''\"''\"\"\"z\"\"\"''\"''", "''zz''", "'''xx'''"]
results in
1: a
2: b c
3: b c
4: a
5: d
6: e
7: x
8: 'y'
9: ''z''
10: zz
11: xx
I'll do some thinking about this. To me, a list of args should be based 1:1 as the arg list of the called process. The whole concept of shell tokanization is wrong.
OTOH, I would guess there are a lot of shell genrules that might break if we change this, and windows does not quite always parse well without quotes, so this is not a simple bug fix.
My hope is that we can devise a mechanism so that binaries can indicate that they parse args correctly, and thus no tokenization is needed. What you list in args is what you get.
@dmivankov, if I understand correctly, you're suggesting to add a flag that changes how Bazel parses args and simultaneously changes how quote
and tokenize
behave internally?
You didn't say that second part explicitly. If we don't change how quote
and tokenize
behave internally, then this would cause double-tokenization unless we can find a way to signal inline that a string should not be tokenized inside Bazel. However, the downside of this approach is that it would be confusing that tokenize
doesn't actually tokenize unless the flag is set.
I think the most straightforward way to do this is to add a tag tokenize-args
(alternatively, use a full attribute). This would support a migration (add tokenize-args
on all rules, flip the flag to not tokenize by default, migrate all rules to explicitly tokenize using a separate method, and remove the tag).
tokenize-args or even just argv=[..] that is never tokenized are also valid options.
With tokenize"e we'd either need them to produce a tagged string/list of strings that indicates if it is wrapped(& into which of them) or not, or indeed make them behave differently depending on the flag. Technically we could also
As another data point, this is also present in the defines
attribute of cc_*
rules and friends.
It was added ostensibly for copts
compatibility, but defines
errors if you have >1 tokens, which makes this quite a bit less useful anyhow.
defines
might be a whole lot easier to clean-up lending themselves to having a per-package feature or something, IDK?
@aiuto how should we categorize this issue?
I don't know.
This is one of those things that does not fit any of the teams. It is execution semantics, but neither local nor remote.
So, either local-exec or xproduct, where some day we will re-triage it.
I think the default implementation should be the obvious one. Parameters in the list should be passed to the binary without any changes.
Doing this might break others who had to workaround this. So anybody fixing the implementation should also put it behind an incompatible flag.
+1. But the obvious thing is that args should be passed as is. Going back to the examples, main2 should print
1: a
2: b c
What do you think should be the behavior of $(execpaths ...)
with this flag? It would probably be more useful if that would still result in multiple arguments.
I think all args in a list should be passed as single strings and never retokenized by blaze. So, if there are paths in execpaths
that have, for example, spaces in them, they should be expanded in a way that preserves the spaces on invocation of the command line.
@aiuto I agree, but what about $(execpaths ...)
resolving to multiple file paths (note the plural form)? In that case you would certainly want the result to be 'file1' 'file2' 'file3'
on the command line rather than 'file1 file2 file3'
, but the $(execpaths ...)
would still appear in a single args
list entry.
For unrelated reasons, I have thought about rewriting the LocationExpander
logic to return a CustomCommandLine
instead of strings. That could solve this issue.
TBH. I'm not wrapping my head around that use case right now. That is, I don't have a mental model of when to use execpaths
. I'm not sure I ever have used it. So, .... I don't know?
The primary use case appears to be complex scripts in genrules, e.g. https://github.com/bazelbuild/bazel/blob/df42879021d66974ddfb74640759d727ebdf8d76/scripts/packages/debian/BUILD#L141-L151 or https://github.com/protocolbuffers/protobuf/blob/7b42f1c08b534f9e43629284e22acce6b13e7136/ruby/BUILD.bazel#L155-L174.
Ideally these would be converted to proper binaries with command-line arguments and custom actions using e.g. Args.add_all(ctx.attr.debian_files, format_each="--debian-file=%s")
. That would resolve quoting/tokenization issues and obviate the need for shell quoting. So maybe we could consider any use of execpaths
etc. a code smell?
Based on the discussion so far and ignoring defines
for the moment, I could see this being a reasonable path forward:
$(FOO)
and $(location //foo:bar)
as if they were tokens that don't contain any spaces.--incompatible_no_args_tokenization
that, if flipped, makes it so that all args
values are treated literally with no postprocessing, with the exception of those targets that are tagged with tokenize-args
.tokenize-args
that have args
values that could possibly be subject to tokenization.tokenize-args
as a catch-all temporary and the tokenize
function in Skylib as a permanent but not 100% compatible alternative.@comius @lberki What do you think?
Maybe in addition to the tokenize-args
tag there should be a no-tokenize-args
tag so that people can opt out eagerly of tokenization independent of the flag.
Or maybe tokenize-args
should be a feature so that users can say repo(default_features = ['-tokenize_args']
in their REPO.bazel?
I recently wrote a starlark rule that treats args without $(location) expansion, without escaping necessary (i.e. $$), and without environment variable expansion (at least not in starlark) and I couldn't be happier.
context: pytest which requires a "main" wrapper calling pytest.main()
def _wrapper_pytest_impl(ctx):
# store pytest args in an expanded template of wrapper_pytest.py
output = ctx.actions.declare_file(ctx.attr.name)
_flags = '\"' + '\",\n \"'.join(ctx.attr.args) + '\"' # quoted
_srcs = '\"' + '\",\n \"'.join([x.path for x in ctx.files.srcs]) + '\"' # quoted
ctx.actions.write(
output = output,
content = """# -*- coding: utf-8 -*-
# standard libraries
import os
import re
import sys
# third party libraries
import pytest
flags = [
""" + _flags + """
]
srcs = [
""" + _srcs + """
]
# n.b. single dollar signs in $ENV expansion only; no shell escaping required
# TODO(kjw): this regex sucks
expanded_flags = [re.sub(r'\\$(\\S+)', lambda mo: os.environ.get(mo.group(1), ''), x) for x in flags]
if __name__ == "__main__":
sys.exit(pytest.main(expanded_flags + srcs + sys.argv[1:]))
""",
is_executable = True,
)
return [DefaultInfo(files = depset([output]))]
wrapper_pytest = rule(
implementation = _wrapper_pytest_impl,
attrs = {
"srcs": attr.label_list(allow_files = True),
"args": attr.string_list(),
"output": attr.output(),
},
)
Caveat: I'm a starlark noob, so I'm sure that there are better ways (i.e. with an Args object) to do this.
Yes, this means that some our_py_test(args=) usage looks funny, but it's correct if you think of python subprocess.Popen([cmd]) usage. i.e.:
our_py_test(args=["-k=a and b"])
Add in the complication that not all cli parsers support (single char flags with equals (i.e. -k=). Preferentially one should switch to the --flag= format which typically does. otherwise you're left with ["-k", "a and b"] which is slightly harder to read, and people not used to strict interpretations of args will be confused:
Description of the problem / feature request:
Arguments set by args attribute of *_binary get splitted
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
BUILD:
main.c:
Testing:
What operating system are you running Bazel on?
Linux/NixOS
What's the output of
bazel info release
?release 3.3.1- (@non-git)
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.Comes from NixOS packages nix-env -iA nixpkgs.bazel
Any other information, logs, or outputs that you want to share?
Luckily
treats ">/dev/null" as string rather than redirection. Could've been an injection vulnerability otherwise
Also tested with recent git version 6bf44ba9339a59c852a5da5f72e14bdb87af6483