GitHub Actions step is hanging until timeout

Lastique commented 3 years ago

Describe the bug

At some point, a GitHub Actions step stops producing output and hangs until the workflow is terminated by timeout.

You can see the output stops being produced if you actively watch the job output in the browser. If you try opening an already hanging job in the browser, the step output is not displayed at all: the ">" sign next to the step is not visible and the icon next to is is a spinning yellow circle. The output becomes available once the workflow is terminated.

Here is an example of one such run. Here is the log with timestamps:

boost_log_gha_hang.txt.gz

I'm sure the processes running in the GHA step are not hanging (this is just a C++ compilation process, which does succeed normally).

This problem seems to be not unique to my setup, there are other reports as well. In my case it happens most often with macos-10.15 runner, but I've also seen it happen with ubuntu-20.04 runner.

To Reproduce

At the moment, this reproduces quite reliably by running the workflow on my Boost.Log repository (git revision boostorg/log@2f058c6d8c3867471d0973b6f57127b5076a4f7f). I intend to try working around the problem (e.g. by disabling the offending runner), so this may not stay reproducible in later revisions.

Expected behavior

The runner should not hang.

Runner Version and Platform

I'm using the free runners hosted by GitHub.

Lastique commented 3 years ago

After deeper investigation it turned out that a process indeed could hang while executing a job step. After this was fixed, the job no longer hangs.

However, I still find it incorrect that I could not access step logs while the job was hanging. This makes it impossible to know what's going on until the job times out or is cancelled. I'm leaving this bug open for this issue to be fixed.

phadej commented 2 years ago

I have seen this lately consitently, e.g.

https://github.com/haskell/aeson/runs/4149850593?check_suite_focus=true is one job where the step doesn't take long (compare to other compiler versions).

This is very frustrating, also knowing that a runner spins there for 6 hours not doing anything meaningful.

phadej commented 2 years ago

Another job https://github.com/haskell/aeson/runs/4178377747?check_suite_focus=true these seem to fail always at the same spot, which is good as the failure is somewhat deterministic. But I have no idea what it may be. I added a memory limit compiler could use (quite conservative), but that doesn't seem to help. Sometimes job succeed sometimes they don't. I still suspect that it is some limit jobs hit, and then they are lost.

phadej commented 2 years ago

I tried out my out-of-memory hypothesis: When the memory gets used the whole job becomes a lot slower and unresponsive:

When I removed the runtime system memory limit the job failed

https://github.com/haskell-CI/haskell-ci/runs/4180347133?check_suite_focus=true

In these cases the job however didn't timeout, but it was unresponsive and logs weren't streamed.

Can high memory usage cause slowdown of the job? Is there anyway to get analytics of how much resources job runners have used?

phadej commented 2 years ago

This problem started to happening quite a lot: e.g.

Previously these jobs didn't have any problems.

chris-griffin commented 2 years ago

One observation is that while the timed-out step will not show any log detail in the Action GUI (it will continue to render as in-progress/yellow), detailed logs of that step are available via the gear icon => "View raw logs".

nikola-jokic commented 2 years ago

Hi everyone,

I will close this issue now since @chris-griffin explained how can you view raw logs. Thank you @chris-griffin!

@phadej, if you are still seeing this issue, and if you are running your workflows on your self-hosted runner, could you double-check if you have enough computing power? It might be due to that. If you cannot find what is causing this issue, please submit a new issue, and all the relevant information so we can re-produce it and see what is causing you this trouble.

phadej commented 2 years ago

@nikola-jokic all jobs above were run on GitHub provided machines.

I don't agree that this issue is resolved. Jobs timeout for non-obvious reasons, and I cannot verify that raw logs would told the reason as they are not available anymore for those jobs. Bummer.

nikola-jokic commented 2 years ago

Hi @phadej, I can take a look, no problem. But while looking at your workflows, the logs are expired. That is why I asked if you might file an issue so we can take a look. Maybe you can suggest how to improve logging to help you with that issue as well?

jeacott1 commented 2 years ago

I'm having a lot of grief recently with this issue. mostly its hung at "Job is about to start running on the hosted runner: Hosted Agent (hosted)" - for hours!

nikola-jokic commented 2 years ago

Hi @jeacott1,

If the job is about to start, and is not started and hanging, the job is not landed on the Runner. It might be due to the resolved incident.

If the issue persist, please post your feedback on the GitHub Community Support Forum which is actively monitored? Using the forum ensures that we route your problem to the correct team. :relaxed:

adarkbandwidth commented 2 years ago

We are seeing this exact issue as well. I know this issue is closed, but no attempt was made to fix it here. Frustrating.

nikola-jokic commented 2 years ago

Hi @adarkbandwidth,

Did you post this issue to the GitHub Community Support Forum? Since the workflow did not land on the runner, the issue does not belong to this repository. That is the reason why it is closed.

Please, post your feedback to the Community Support Forum :relaxed:

bryanmacfarlane commented 2 years ago

There is likely not an bug/issue to fix.

As mentioned above, a good thing to do is set a timeout on the job. If it typically takes 5 minutes for example in normal conditions, setting a timeout of 20 will help free the concurrency slot (self hosted machine or cost if hosted) in the event of a hang.

Outside of getting the raw logs (which may or may not have more details - depends on the output of the process the step is running), you can also try to run on a self hosted VM (just for debugging) that you have access to (can log in and watch resources etc). It's a good debugging technique. You can even install the runner on your dev desktop. Note that might not match the resources of the hosted VM. If you want a VM that ~matches characteristics, you can get an azure DS2V2 VM with 2 cores.

There is a potential feature to get better visibility into what's happening on the hosted machine. Debug into the hosted machine/runner etc. we could also expose perf counters, get a dump of a process, etc. out of disk space detection is another potential issue / feature. That has been discussed in and is on a features back log but not bug level work. We could even have the runner detect no output after xxx time but even then it doesn't fix anything - it's just a more efficient timeout option.

A clarification of typical issues:

test hang / deadlock: the runner runs tests which runs your tests. if the tests deadlock, the test process just hangs, no output is created. the tests might not hang on your desktop but on another machine / VM with different I/O characteristics, it could expose the fragile code and deadlock. This could also happen in any process you supply for the runner to run. There is nothing the runner can do here. It's waiting for output and a return code.
resource issues OOM: the runner runs a step which runs the process / cmdlines you provide. if those processes use too much memory it can starve other processes on the machine. on linux we run your steps in a cgroup to ensure you can only use 80% of the memory and not starve the runner of resources. The linux OOM killer will kill your process and it will fail. On other OSes there isn't that capability and a runaway process using too much resources can actually starve the runner of memory, cpu, i/o etc
out of disk space: if the processes you supply in steps run the machine out of disk space, it can be challenging to figure out what's going on as the runner cannot even create logs to get them up to the service. As mentioned above, there is a potential feature to monitor resources and get better visibility. In this case, we could monitor when x% is reached and add a warning to the build. That falls under feature work mentioned above. Note that this won't lead to a hang that's mentioned here. The runner maintains a heartbeat (renews a lock) to the service and if it "goes away" the server will abandon the job. So that's likely not your issue.

Hope that helps clarifies many of the common issues, challenges and potential features.

adarkbandwidth commented 2 years ago

Thank you for this detailed response. We're focusing on working through these issues. We're not using a hosted runner so we can't observe behavior outside of the Github logs (which are missing in this case), but we can observe the outputs for some steps.

adarkbandwidth commented 2 years ago

FWIW, we realized that when we cancel the action, we do then get logs from the hosted runner, and we're able to troubleshoot from there. Our code is hanging—not GHA.

fringedgentian commented 1 year ago

We had this same issue and it was due to including apache commons 2.9 or later that caused it. We reverted to 2.7 and the test ran without issue.

implementation 'commons-io:commons-io:2.9.0'

prein commented 1 year ago

Would it be a good idea (if possible) to implement something like an inactivity / idle timeout in addition to the already implemented "general" timeout? Imagine a job that is expected to run for 20 minutes on average. I set the timeout to 25 min. Even if the job hangs in the first minute of execution it will be killed only after 25 minutes. If there is idle timeout that I can set to 5 min, then if the job hangs in 1st min of the run, it will be killed in the 6th minute saving 19 min in my example.

prein commented 1 year ago

Compare https://support.circleci.com/hc/en-us/articles/360007188574-Build-has-Hit-Timeout-Limit

MhdSadd commented 11 months ago

After deeper investigation it turned out that a process indeed could hang while executing a job step. After this was fixed, the job no longer hangs.

However, I still find it incorrect that I could not access step logs while the job was hanging. This makes it impossible to know what's going on until the job times out or is cancelled. I'm leaving this bug open for this issue to be fixed.

@Lastique how did you fix the hanging job without getting a log about why it might be hanging? I have a nestjs-digitalocean CI which hangs on the build command until it timeout, same thing as everyone, no log except for this in the raw log

2023-11-23T23:12:57.3718126Z Waiting for a runner to pick up this job...
2023-11-23T23:12:57.8616058Z Job is waiting for a hosted runner to come online.
2023-11-23T23:13:00.7732105Z Job is about to start running on the hosted runner: GitHub Actions 2 (hosted)

locally it builds fine on my Mac ios Ventura 13.6.1 (22G313)

Lastique commented 11 months ago

@Lastique how did you fix the hanging job without getting a log about why it might be hanging?

I don't remember the details at this point, but I was able to debug the problem. I think, the log was visible if I was monitoring the job from its start, or after it was cancelled.

That doesn't seem like your case though. In your case the job doesn't start for whatever reason.

Mac OS runners are a scarce resource on GHA, you do sometimes have to wait for a long time for a runner to become available. Again, I'm not sure if that's what happens in your case.

MhdSadd commented 11 months ago

@Lastique so my action is actually configured to run ubuntu-latest, and it was running fine before, recently though it starts the job and goes through the steps until it gets to the build step (this is the step it hangs on, until it timeout). The CI is meant to clone and deploy nestjs microservice via SSH to digitalocean.

MhdSadd commented 11 months ago

Turns out my own issue is memory-related, using Digitalocean's droplet to deploy several microservices, turns out each time we try to deploy the service and we're low on memory the OOM killer terminates the process but the action keeps trying until it times out. Increasing our resources and adding a swap memory fixed it. My question though is, is there not a better way for the action to notify us after the trials or throw some sort of out of memory error in its log.

talregev commented 9 months ago

Still happen.

dta-bdai commented 1 week ago

This problem has happened to us recently when running bazel build inside a docker container. Github runner image: Ubuntu 22.04, version 20241006.1.0.

The build finished successfully, the container and its network were cleaned up properly, but it took a very long time for Github to realize the completion. If I build just a part of our repo, it takes 14m10s to finish, but the yellow circle keeps spinning for another 18m until it turns green (Total wait time is ~33m). If I build the whole repo, it keeps spinning and fails due to timeout (60m), although the build succeeded much earlier.

I have made sure bazel terminated gracefully with bazel shutdown and used ps -aux to confirm. I've also used tmate to ssh into the container and made sure no network nor disk activities happen before the container was cleaned up. I have also tried to limit bazel resources to 50% RAM with only 2 jobs. Nothing works.

Attached are some screenshots related to the issue we're facing. We’ve exhausted our troubleshooting ideas and would greatly appreciate any assistance. Thank you!

Job completed, container stopped, but wheel was still spining:

Verify no strange processes were running. Bazel was shut down:

Job finished in 14m10s, but we had to wait for another 18m for Github to notice:

actions / runner

GitHub Actions step is hanging until timeout #1326

Runner Version and Platform