*_binary args get splitted

dmivankov commented 4 years ago

Description of the problem / feature request:

Arguments set by args attribute of *_binary get splitted

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

BUILD:

cc_binary(
  name = "main",
  srcs = [":main.c"],
)
cc_binary(
  name = "main2",
  srcs = [":main.c"],
  args = ["a", "b c"], # "b c" gets split
)
cc_binary(
  name = "main3",
  srcs = [":main.c"],
  args = ["a", "\"b c\""], # excessive quoting is needed
)
cc_binary(
  name = "main4",
  srcs = [":main.c"],
  args = ["a \"b c\""], # we could say singleton lists are preferred and are splitted
)

main.c:

#include <stdio.h>
int main(int argc, char** argv) {
    for (int i = 1; i < argc; ++i) {
        printf("%d: %s\n", i, argv[i]);
    }
    return 0;
}

Testing:

$ bazel run :main -- a "b c"
1: a
2: b c
$ bazel run :main2
1: a
2: b
3: c
$ bazel run :main3
1: a
2: b c
$ bazel run :main4
1: a
2: b c

What operating system are you running Bazel on?

Linux/NixOS

What's the output of `bazel info release`?

release 3.3.1- (@non-git)

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

Comes from NixOS packages nix-env -iA nixpkgs.bazel

Any other information, logs, or outputs that you want to share?

Luckily

cc_binary(
  name = "main5",
  srcs = [":main.c"],
  args = ["a \"b c\" >/dev/null"],
)

treats ">/dev/null" as string rather than redirection. Could've been an injection vulnerability otherwise

Also tested with recent git version 6bf44ba9339a59c852a5da5f72e14bdb87af6483

dmivankov commented 4 years ago

extra info

$ bazel build :main
$ bazel run --shell_executable=$(pwd)/bazel-bin/main :main2 -- "d e"
1: -c
2: SOME_PATH/execroot/__main__/bazel-out/k8-fastbuild/bin/main2 a b c 'd e'

also

BUILD file args=[" 'x' "] -> argv = ["x"]
bazel run -- " 'x' " -> argv = [" 'x' "]

dmivankov commented 4 years ago

Looks like the issue is https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java#L178 calling https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/RunfilesSupport.java#L429 which has ".tokenized()" call inside

tetromino commented 4 years ago

Not sure who owns RunfilesSupport, but this looks like a real logic bug.

jwnimmer-tri commented 4 years ago

To my reading, https://docs.bazel.build/versions/master/be/common-definitions.html#binary.args specifically says that args undergoes Bourne shell tokenization, which explaining the splitting. I don't understand what the bug is?

dmivankov commented 4 years ago

May be just a documentation issue then https://docs.bazel.build/versions/master/be/common-definitions.html#sh-tokenization

Certain string attributes of some rules are split into multiple words according to the tokenization rules of the Bourne shell

while https://docs.bazel.build/versions/master/be/common-definitions.html#binary.args

args | List of strings;

Intuitively, when you supply array of strings, you don''t expect split and combine the lists behavior, also sh-tokenization is said to apply to string arguments, not to lists of string.

ulfjack commented 4 years ago

If I recall correctly, this has come up a few times before. I agree that the behavior should be changed to avoid shell tokenization; however, this is a difficult transition. It's not enough to have a flag because there's no way to write args in a way that is compatible with and without shell tokenization.

dmivankov commented 4 years ago

One way to migrate is

add a flag that fails the build if tokenization changes the args (splits or removes spaces), should be doable as tokenization is done by bazel
add a builtin function tokenize(): string(or string..?) -> [string], need to carefully check how it interacts with variable expansion though then compatible way would be
```
args = ["--foo"] + tokenize(something) + ["abc"]
# or
args = ["--foo"] + tokenize(something1, something2) + ["abc"]
```
which would work regardless of flag given that there are no splits in free args

dmivankov commented 4 years ago

We also probably need builtin quote(string_with_spaces_or_quotes): string -> string So canonical arguments would be

args = ["--foo", "abc", quote(" x "), quote("'"), quote("\'x\'")] + tokenize("a \"b b\"") + tokenize("c d", "e f")
# meaning
argv = ["--foo", "abc", " x ", "'", "\'x\'", "a", "b b", "c", "d", "e", "f"]

and rejected arguments would be

args = ["a b", " x ", "\"x\"", "\'y\'", "\"x y\"", "a \"b\" c" ] + tokenize("a \" b")

Note that currently unquoting seems to do funny and unpredictable things

args = ["a", " \"b c\" ", " \"\"b\" c\" ", "'a ' ", "'d'", "\"e\"", "\"\"x\"\"", "\"'y'\"", "''\"''\"\"\"z\"\"\"''\"''", "''zz''", "'''xx'''"]

results in

1: a
2: b c
3: b c
4: a 
5: d
6: e
7: x
8: 'y'
9: ''z''
10: zz
11: xx

aiuto commented 4 years ago

I'll do some thinking about this. To me, a list of args should be based 1:1 as the arg list of the called process. The whole concept of shell tokanization is wrong.
OTOH, I would guess there are a lot of shell genrules that might break if we change this, and windows does not quite always parse well without quotes, so this is not a simple bug fix.

My hope is that we can devise a mechanism so that binaries can indicate that they parse args correctly, and thus no tokenization is needed. What you list in args is what you get.

ulfjack commented 4 years ago

@dmivankov, if I understand correctly, you're suggesting to add a flag that changes how Bazel parses args and simultaneously changes how quote and tokenize behave internally?

You didn't say that second part explicitly. If we don't change how quote and tokenize behave internally, then this would cause double-tokenization unless we can find a way to signal inline that a string should not be tokenized inside Bazel. However, the downside of this approach is that it would be confusing that tokenize doesn't actually tokenize unless the flag is set.

I think the most straightforward way to do this is to add a tag tokenize-args (alternatively, use a full attribute). This would support a migration (add tokenize-args on all rules, flip the flag to not tokenize by default, migrate all rules to explicitly tokenize using a separate method, and remove the tag).

dmivankov commented 4 years ago

tokenize-args or even just argv=[..] that is never tokenized are also valid options.

With tokenize&quote we'd either need them to produce a tagged string/list of strings that indicates if it is wrapped(& into which of them) or not, or indeed make them behave differently depending on the flag. Technically we could also

make quote add '' when needed regardless of flag, so quote("a")=="a", quote("a b") == "'a b'"
make tokenize produce quote-ed elements
do double tokenization, but second run shouldn't be changing anything
flag only controls whether not using quote/tokenize is forbidden for args with spaces/special chars
and flag somehow only applies to user-provided strings, result of quote("a b") is "'a b'" doesn't fail the build but users can't specify such strings manually

trybka commented 2 years ago

As another data point, this is also present in the defines attribute of cc_* rules and friends.

It was added ostensibly for copts compatibility, but defines errors if you have >1 tokens, which makes this quite a bit less useful anyhow.

defines might be a whole lot easier to clean-up lending themselves to having a per-package feature or something, IDK?

gregestren commented 2 years ago

@aiuto how should we categorize this issue?

aiuto commented 2 years ago

I don't know.
This is one of those things that does not fit any of the teams. It is execution semantics, but neither local nor remote. So, either local-exec or xproduct, where some day we will re-triage it.

comius commented 1 year ago

I think the default implementation should be the obvious one. Parameters in the list should be passed to the binary without any changes.

Doing this might break others who had to workaround this. So anybody fixing the implementation should also put it behind an incompatible flag.

aiuto commented 1 year ago

+1. But the obvious thing is that args should be passed as is. Going back to the examples, main2 should print

1: a
2: b c

fmeum commented 1 year ago

What do you think should be the behavior of $(execpaths ...) with this flag? It would probably be more useful if that would still result in multiple arguments.

aiuto commented 1 year ago

I think all args in a list should be passed as single strings and never retokenized by blaze. So, if there are paths in execpaths that have, for example, spaces in them, they should be expanded in a way that preserves the spaces on invocation of the command line.

fmeum commented 1 year ago

@aiuto I agree, but what about $(execpaths ...) resolving to multiple file paths (note the plural form)? In that case you would certainly want the result to be 'file1' 'file2' 'file3' on the command line rather than 'file1 file2 file3', but the $(execpaths ...) would still appear in a single args list entry.

For unrelated reasons, I have thought about rewriting the LocationExpander logic to return a CustomCommandLine instead of strings. That could solve this issue.

aiuto commented 1 year ago

TBH. I'm not wrapping my head around that use case right now. That is, I don't have a mental model of when to use execpaths. I'm not sure I ever have used it. So, .... I don't know?

phst commented 10 months ago

The primary use case appears to be complex scripts in genrules, e.g. https://github.com/bazelbuild/bazel/blob/df42879021d66974ddfb74640759d727ebdf8d76/scripts/packages/debian/BUILD#L141-L151 or https://github.com/protocolbuffers/protobuf/blob/7b42f1c08b534f9e43629284e22acce6b13e7136/ruby/BUILD.bazel#L155-L174.

Ideally these would be converted to proper binaries with command-line arguments and custom actions using e.g. Args.add_all(ctx.attr.debian_files, format_each="--debian-file=%s"). That would resolve quoting/tokenization issues and obviate the need for shell quoting. So maybe we could consider any use of execpaths etc. a code smell?

fmeum commented 5 months ago

Based on the discussion so far and ignoring defines for the moment, I could see this being a reasonable path forward:

Faithfully implement https://github.com/bazelbuild/bazel/blob/8ed7196aee2f312f131df58ea69f9d2b443d3f6a/src/main/java/com/google/devtools/build/lib/shell/ShellUtils.java#L90 in Starlark and make it available in bazel-skylib. Since it can't perform location expansion, it would treat $(FOO) and $(location //foo:bar) as if they were tokens that don't contain any spaces.
Add a new --incompatible_no_args_tokenization that, if flipped, makes it so that all args values are treated literally with no postprocessing, with the exception of those targets that are tagged with tokenize-args.
Provide a migration script (buildozer command) that tags all targets with tokenize-args that have args values that could possibly be subject to tokenization.
Flip the flag, offering tokenize-args as a catch-all temporary and the tokenize function in Skylib as a permanent but not 100% compatible alternative.

@comius @lberki What do you think?

phst commented 1 month ago

Maybe in addition to the tokenize-args tag there should be a no-tokenize-args tag so that people can opt out eagerly of tokenization independent of the flag.

Or maybe tokenize-args should be a feature so that users can say repo(default_features = ['-tokenize_args'] in their REPO.bazel?

kjw-enf commented 2 weeks ago

I recently wrote a starlark rule that treats args without $(location) expansion, without escaping necessary (i.e. $$), and without environment variable expansion (at least not in starlark) and I couldn't be happier.

context: pytest which requires a "main" wrapper calling pytest.main()

def _wrapper_pytest_impl(ctx):
    # store pytest args in an expanded template of wrapper_pytest.py
    output = ctx.actions.declare_file(ctx.attr.name)
    _flags = '\"' + '\",\n  \"'.join(ctx.attr.args) + '\"'  # quoted
    _srcs = '\"' + '\",\n  \"'.join([x.path for x in ctx.files.srcs]) + '\"'  # quoted
    ctx.actions.write(
        output = output,
        content = """# -*- coding: utf-8 -*-
# standard libraries
import os
import re
import sys
# third party libraries
import pytest
flags = [
  """ + _flags + """
]
srcs = [
  """ + _srcs + """
] 
# n.b. single dollar signs in $ENV expansion only; no shell escaping required
# TODO(kjw): this regex sucks
expanded_flags = [re.sub(r'\\$(\\S+)', lambda mo: os.environ.get(mo.group(1), ''), x) for x in flags]
if __name__ == "__main__":
    sys.exit(pytest.main(expanded_flags + srcs + sys.argv[1:]))
""",
        is_executable = True,
    )
    return [DefaultInfo(files = depset([output]))]

wrapper_pytest = rule(
    implementation = _wrapper_pytest_impl,
    attrs = {
        "srcs": attr.label_list(allow_files = True),
        "args": attr.string_list(),
        "output": attr.output(),
    },
)

Caveat: I'm a starlark noob, so I'm sure that there are better ways (i.e. with an Args object) to do this.

Yes, this means that some our_py_test(args=) usage looks funny, but it's correct if you think of python subprocess.Popen([cmd]) usage. i.e.:

our_py_test(args=["-k=a and b"])

Add in the complication that not all cli parsers support (single char flags with equals (i.e. -k=). Preferentially one should switch to the --flag= format which typically does. otherwise you're left with ["-k", "a and b"] which is slightly harder to read, and people not used to strict interpretations of args will be confused:

"doesn't that need quotes?"
"why is it two args and not one?"
"why is it one arg and not two?"

bazelbuild / bazel