bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.21k stars 4.06k forks source link

Bazel does not notice compilation action finished and waits for it forever #15094

Open konste opened 2 years ago

konste commented 2 years ago

Description of the problem / feature request:

Occasionally (for me specifically it happens once in a day or two) Bazel build gets into the state when it runs forever and never finishes. When this happens you can see something like the following on the console:

[17,773 / 17,796] Compiling .../tabui/main/calculationeditor/qtdialogs/CalculationDialogWidget.cpp; 4150s local
[17,773 / 17,796] Compiling .../tabui/main/calculationeditor/qtdialogs/CalculationDialogWidget.cpp; 4153s local
[17,773 / 17,796] Compiling .../tabui/main/calculationeditor/qtdialogs/CalculationDialogWidget.cpp; 4159s local
[17,773 / 17,796] Compiling .../tabui/main/calculationeditor/qtdialogs/CalculationDialogWidget.cpp; 4160s local

Pay attention to ridiculously high number of seconds the compilation is going. Looking at the produced object files I can tell that compilation has finished long ago (object file produced and correct), but somehow Bazel does not notice it is done and waits for it to finish forever. Another observation: when this happens one must hit Ctrl+C three times to terminate Bazel server. After the first and second time it just continue incrementing the seconds.

As you can imagine it is a major bug because it causes CI machines to "hung" occasionally on such runaway builds. I wonder if there is a way to set timeout on the compilation action to at least mitigate it and avoid complete hanging.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

We could not reproduce this problem on demand.

What operating system are you running Bazel on?

We run this build on Mac, Linux and Windows, but the bug seems to be specific to Windows.

What's the output of bazel info release?

5.0.0

Have you found anything relevant by searching the web?

I have found the issue https://github.com/bazelbuild/bazel/issues/4216 which manifested itself in exactly the same way, but that bug was specific to Linux and got fixed in 2017 after Linux kernel update. Don't know how relevant it may be.

Any other information, logs, or outputs that you want to share?

I preserved java.log when it happened today and shared it on my OneDrive here. You can see plenty of errors at the end of the log, but most of them I believe caused by me hitting Ctrl+C three times to stop the build.

meteorcloudy commented 2 years ago

It would be really helpful if you can provide a minimal reproduce case, otherwise it's almost impossible to diagnose this issue, which might even not be a Bazel bug.

konste commented 2 years ago

We have not seen this problem since we switched from Bazel 5.0.0 to 5.1.0 and keep fingers crossed that maybe it took care of it somehow. But on the other hand we are on 5.1.0 for a short time, so we keep an eye on it.

I understand that it is hard do diagnose because it is hard to reproduce. There is no minimal repro or any repro at all. It just randomly happens on a large builds. The only idea I have is to take process memory dump when it is in that state and then try to analyse it.

We checked that when Bazel gets into this infinite wait state no compiler processes are running, so the process Bazel is waiting for completion is definitely gone.

sventiffe commented 2 years ago

@konste did you observe the issue happening again since switching to 5.1.0?

konste commented 2 years ago

@sventiffe Unfortunately yes - just once, yesterday. So I can tell that the frequency of this issue reduced dramatically. On 5.0.0 we were seeing it every day. On 5.1.0 we got it once in two weeks with about the same frequency of the daily builds.

gregoryT5 commented 2 years ago

As another data point, I have also observed this issue, only on Windows, with Bazel 5.0.0, 4.2.0, and 4.0.0.

We also build for Linux using the same Bazel versions, and I've never seen this issue occur there. I don't have anything specific to reproduce it either, as it almost always gets stuck on a different build target (no observable pattern, other than "sometimes happens on large build").

meisterT commented 2 years ago

Can you please capture a bazel stack trace the next time this happens (e.g. with jstack or with pressing Ctrl+\)?

konste commented 2 years ago

First of all I can confirm that the frequency of the issue reduced between Bazel 5.0.0 and Bazel 5.1.0 tenfold, may be more. This is a good thing in general, but it makes repro even more problematic.

Unfortunately it still happens (about once a week for us) and I would really appreciate some guidance how to collect as much useful information as possible when we get a repro. Remember, it is on Windows.

One thing which comes to mind is to take full process memory dump when it happens. This should provide the most data, but the size of the file would be about 20 Gb in our case and I'm not even sure how to send it over.

meisterT commented 2 years ago

@konste not sure how time consuming it is, but bisecting the commit between 5.0 and 5.1 which helped reducing this tenfold, would probably tell us where to look.

konste commented 2 years ago

Two considerations prevent me from doing it right now:

  1. I anticipate such bisecting to take a long time, simply because it takes about a day to get a repro on 5.0.0 and a week on 5.1.0.
  2. Considering the issue is not fully fixed in 5.1.0 such bisecting may narrow at a completely irrelevant change which just happened to change some timing and cause race condition happen less frequently.

Manual analysis of the changes between 5.0.0 and 5.1.0 only flags this one as a possible culprit though I don't understand how it is related.

meisterT commented 2 years ago

I wonder why you think this commit is relevant for this bug?

In any case, next time you observe it please do Ctrl+\ or jstack <bazel-server-pid> to capture a stack trace.

konste commented 2 years ago

The issue seems with Bazel not noticing child process exit and the issue I linked mentions ProcessHandle which can be used for such detection. As vague as that.

Regarding Ctrl+\ are you sure it works on Windows? Which process it is going to capture the stack trace from? Where do I find it after it is captured? You don't have to answer if you don't immediately know - I will research it myself.

meisterT commented 2 years ago

Right, Ctrl+\ doesn't seem to work on Windows, but jstack with the server pid does. It spits out the stack trace to the console, so redirect it to a file. This will contain info about what Bazel is doing and hopefully help us to figure out where it hangs.

konste commented 2 years ago

@meisterT Here is jstack at the time of repro: jstack1.txt Server process minidump is here.

konste commented 2 years ago

Second repro: jstack2.txt

konste commented 2 years ago

Third repro: jstack3.txt

WilliamVenner commented 2 years ago

Same issue on 5.1.1

gregoryT5 commented 1 year ago

@konste, are you by any chance using MSVC to compile?

konste commented 1 year ago

Yes, naturally. Although I have not seen this issue happening for quite some time.

gregoryT5 commented 1 year ago

Ok, so this might be relevant for you in case it comes up again. I was running into a problem where vctip.exe (aka "Microsoft® VC compiler and tools experience improvement data uploader") was becoming an unstoppable zombie process. It gets launched by cl.exe for some reason, and sometimes would just get into this bad state.

Other people have encountered this problem in the past:

and found that it worked to simply delete vctip.exe. So I tried doing that about three months ago, and haven't seen the issue since.

konste commented 1 year ago

Aha! Thank you @gregoryT5 for the tip! I have noticed occasional zombification of vctip.exe, typically when I try to delete the folder and cannot because it is locked by this process frozen in there. For this reason I routinely delete vctip.exe when I see it and that's probably why I stopped seeing this issue with Bazel. The more you know...

aaomidi commented 2 months ago

I think I'm running into this on GitHub Actions CI builders on Linux (ARM specifically) builds for Go. Bazel 7.3.0