bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.99k stars 4.03k forks source link

Timeout execution requirement doesn't work on starlark rules and genrule #12349

Open uri-canva opened 3 years ago

uri-canva commented 3 years ago

Description of the problem / feature request:

Support timeout execution requirement on starlark rules and genrules as a way to terminate stuck / hanging actions, so that their stderr / stdout can be printed out to aid debugging.

Feature requests: what underlying problem are you trying to solve with this feature?

Without such a feature when an action hangs, it can be hard to see the output it has printed so far, especially on non-interactive systems like CI. Another option might be to stream the output of the actions like requested in https://github.com/bazelbuild/bazel/issues/7213, but that would require running actions sequentially, so it could only used when reproducing the issue.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Create a starlark rule and a genrule that never terminate, set execution_requirements = {"timeout": "5"} on the starlark rule, and tags = ["timeout:5"] on both, then run the build with --experimental_allow_tags_propagation, note that neither rule is terminated.

What operating system are you running Bazel on?

Both macOS and Linux

What's the output of bazel info release?

release 3.3.0

Have you found anything relevant by searching the web?

https://github.com/bazelbuild/bazel/issues/7766 https://github.com/bazelbuild/bazel/issues/8830

ulfjack commented 3 years ago

Local execution already enforces action timeouts where specified. However, there is no default timeout. You can add a tag "timeout:" to enforce a timeout for local execution (I think). That might be enough to cover the use case you have. It would also not be too difficult to add a flag to Bazel to set a default timeout for all locally executed actions.

I think you're proposing to add a feature to ActionExecutionInactivityWatchdog to print the stdout/stderr of currently executing actions after a certain 'timeout'. I am not sure I would do that. It would add another unrelated mechanism that doesn't seem like it carries its own weight.

uri-canva commented 3 years ago

I see the timeout execution requirement in the source, but it doesn't seem to work. I tried execution_requirements = {"timeout": "5"} on a starlark action and tags = ["timeout:5"] or tags = ["timeout=5"] on both a starlark target and a genrule target, both with and without --experimental_allow_tags_propagation.

I couldn't find any documentation on it, am I using it wrong? I'll change the issue in the meantime.

uri-canva commented 3 years ago

@jin would you mind double checking the tags on this issue? Now that it's about making the timeout execution requirement work on starlark rules it might be a bug rather than a feature.

uri-canva commented 3 years ago

@ishikhman do you know if there's a known issue with the interaction between experimental_allow_tags_propagation and timeout?

ulfjack commented 3 years ago

That is odd. Do you have a simple reproduction that we could look at?

larsrc-google commented 3 years ago

Can you show a bit more of what you're doing that's not working? A rule definition or a build log? I just ran across the timeout in connection with multiplex workers, and it certainly looks like timeouts are not treated right in either worker - the timeout is set on the process, but for workers the process lives forever, and nobody seems to call the methods that check the timeout under normal circumstances.

meisterT commented 3 years ago

Please reply, if you have a repro case and we can reopen this issue.

sushain97 commented 3 years ago

@meisterT

Here's a minimal reproduction under remote execution (bazel client v3.0.0):

BUILD.bazel:

load("sleep.bzl", "sleep")
sleep(name = "sleep")

sleep.bzl:

def _sleep(ctx):
    out = ctx.actions.declare_file("out")
    ctx.actions.run_shell(
        command = "sleep 5; touch {}".format(out.path),
        outputs = [out],
        execution_requirements = {
            "timeout": "1",
        },
    )
    return [DefaultInfo(files = depset([out]))]

sleep = rule(
    _sleep,
)

The action succeeds:

$ ./bazel build --config=remote //:sleep --experimental_remote_grpc_log=/tmp/bazel-grpc.log
INFO: Invocation ID: c5454b1d-28f6-4ea6-92d1-154cf2ba30ba
INFO: Analyzed target //:sleep (1 packages loaded, 2 targets configured).
INFO: Found 1 target...
Target //:sleep up-to-date:
  bazel-bin/out
INFO: Elapsed time: 7.529s, Critical Path: 5.25s
INFO: 1 process: 1 remote.
INFO: Build completed successfully, 2 total actions

Of course, this would happen if our remote execution implementation was faulty.

However, there's simply no timeout field in the ExecuteRequest:

$ grpc call remotebuild-storage.service.envoy:10080 google.bytestream.ByteStream/Read "resource_name: 'blobs/a7dc37d0fe4be154eca041e062b16ea147bb8831d148ae9eb29ff48606e0cb18/139'"
connecting to remotebuild-storage.service.envoy:10080
Received initial metadata from server:
date : Wed, 17 Feb 2021 06:41:00 GMT
server : envoy
x-envoy-upstream-service-time : 1
data: "\nE\n@a53b153b6206dc7f6212abff622be4ab8c7249aba882b7d6b68545e5d2b80615\020\217\002\022B\n@e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"

Rpc succeeded with OK status
$ printf "\nE\n@a53b153b6206dc7f6212abff622be4ab8c7249aba882b7d6b68545e5d2b80615\020\217\002\022B\n@e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" | protoc --decode_raw
1 {
  1: "a53b153b6206dc7f6212abff622be4ab8c7249aba882b7d6b68545e5d2b80615"
  2: 271
}
2 {
  1: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

Full GRPC log: https://gist.github.com/sushain97/366d9efa18a529fb2c62cb18117d1ce0

A naive inspection of the relevant code doesn't reveal an obvious bug:

However, there are many layers of abstractions here so I'm probably missing something.

Related, I wasn't able to find any documentation for build action timeouts, whether specified as tags or execution requirements. Happy to be corrected but otherwise, this seems like a gap to fill?

sventiffe commented 3 years ago

Grooming the backlog, is "more data needed" still accurate here? Else, Chi, would you be able to triage this issue, please?

uri-canva commented 3 years ago

I can also add more information if needed.

coeuvre commented 3 years ago

Remote execution doesn't support timeout execution requirement yet. I have the entry to revisit timeout behavior for remote executor on my TODO list and will address this during that effort.

uri-canva commented 3 years ago

It doesn't work on local execution either though, that's what this issue is about.

coeuvre commented 3 years ago

Can you please share a repro case with local execution?

sluongng commented 2 years ago

@coeuvre @ulfjack I am running into this problem right now where my user have a buggy / flaky genrule that could hang the build forever. I really want to have a global flag that enforce a certain timeout duration on all actions (especially build actions) if possible.

It seems like in this issue, the solution is deviating toward using exec_requirements. After having review the execution requirement code at https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/actions/ExecutionRequirements.java;l=122;drc=a2e9f6e401c3fea92a6d515083553ffefdafc24a, Im pretty confident in saying that the TIMEOUT exec_requirement is currently only applied through src/main/java/com/google/devtools/build/lib/exec/StandaloneTestStrategy.java which only apply to test action and cannot be set via tags = ["timeout:<decimal>"]

Are you guys ok with changing this into a ParseableRequirement similar to CPU?


Here is a simple reproducible


> cat BUILD
genrule(
    name = "create_a",
    outs = ["a.txt"],
    cmd = """
    sleep infinity;
    touch $@
    """,
    tags = [
        "timeout:1",
    ],
)```
linzhp commented 1 year ago

I really want to have a global flag that enforce a certain timeout duration on all actions (especially build actions) if possible.

We need this flag too. We've seen some actions taking half an hour on a CI machine, while only taking less than half a minute on a local machine. We are still not sure what's going on in those CI machines, but it would really help if we can timeout a local action so we can restart the build as early, possibly on a different machine.

sbjorkle-qualcomm commented 5 months ago

Remote execution doesn't support timeout execution requirement yet. I have the entry to revisit timeout behavior for remote executor on my TODO list and will address this during that effort.

@coeuvre, what's the status of getting support for custom action timeouts in remote action execution requirements?