Closed laszlocsomor closed 6 years ago
/cc @philwo
@laszlocsomor Nice screenshot. Thanks! @cushon FYI
Unless someone has an idea just from looking at the screenshot or code, we might have to run @kohsuke's File Leak Detector (http://file-leak-detector.kohsuke.org/) on the worker process and hope to get a repro. It should pinpoint where in the code that file handle is opened, which hopefully also tells us why it's not closed.
Thanks for the suggestion Philipp! I haven't seen this bug since, but leaving it at P2 because if it does show up, I'd like to deal with it.
I believe this issue has already been fixed by https://github.com/bazelbuild/bazel/commit/6a4247b10f5cf040c1a7176498bef69c75b1b286, where we explicitly specify which handles to inherit when creating process with JNI code and use that JNI function to create Java worker.
Thanks! Let's close it then.
I'm still seeing this very frequently (on multiple windows machines) on bazel 0.10.0. However, only when trying to link .so's for Android NDK APKs.
For example:
ERROR: xxx/BUILD.bazel:17:1: failed to delete output files before executing action: C:/dev/tmp/_bazel_pascal/csxukb-c/execroot/xxx/bazel-out/android-arm64-v8a-fastbuild/bin/xxx/nativedeps/xxx/libxxx.so (Permission denied)
@laszlocsomor Maybe this is caused by some android toolchain not releasing the handler?
@meteorcloudy : Without knowing more, I can only guess. Luckily though, we can also find it out! Read on.
@pmuetschard : Could you please download Process Explorer and the next time you see this error, use Process Explorer to search for processes holding open that file? You can do so by pressing Ctrl+F (in Process Explorer) to bring up the search window, entering the basename of the file (e.g. "libxxx.so"), then finding the process(es) that hold that file.
Unfortunately, process explorer would not find anything. The thing is, re-running bazel after a failure, usually allows it to "get past" the .so that failed. So it looks more like a race condition. In our project, we build quite a few libs ~20, and so it would then fail on another one if I just re-run bazel build
. I tried using -k
and searching for an so as soon as the error is printed, while bazel was still doing more work, but alas, no luck - it's as if whatever holds onto the file, releases it as soon as the failure happens. In my runs with -k
it would fail on more than one .so, too. We also extract the .so files and "re-package" them by using the extracted file as input to another android_binary
rule, so I estimate that there are nearly 100 points where this could fail. Since most of them pass, this indeed does look much more like a race than a systemic issue.
Something to note: this only happens on "incremental" builds. I.e., I've not seen this error happen on our CI build servers, which start from a clean checkout each time. However, I have seen this on multiple different Windows machines.
I tend to eagerly suspect the AntiVirus or Bit9/Carbon Black, but in this case I believe it's neither.
The fact that you only saw this error on incremental builds but not clean ones, and that Bazel shows the "failed to delete output files..." error when it fails to delete outputs from the previous build, could be explained by a background process still holding the file open. This could be a persistent worker (perhaps JavaBuilder) that holds on to the output file, either because it doesn't close the file at all, or it doesn't close it in a timely manner.
I need to find a way to repro this locally, that's what I'll do next.
@pmuetschard : Is there anything else you notice, like any patterns in the kind of changes after which you see the errors? For example, do you change dependencies of rules, add/remove rules; or do you see the error already when you do minor edits in source files and want to rebuild?
I see it for all of the above. Naturally, I see it most often for changes that cause all of the libs to be rebuilt.
I also don't think it's the AntiVirus, because I see it, too on a machine that has no anti-virus and I use a custom %TMP%
, which is excluded from Windows 10's malware scanner.
FYI, I see this when building GAPID.
Any updates on this? I see this very frequently. Always with Android native libraries. Below is a session. Note that I did nothing in between bazel invocation. Simply ran it again each time it failed:
C:\dev\gapid>bazel build pkg
INFO: Analysed target //:pkg (0 packages loaded).
INFO: Found 1 target...
ERROR: C:/dev/gapid/gapii/apk/BUILD.bazel:17:1: failed to delete output files before executing action: C:/users/pascal/_bazel_pascal/cmrueb7c/execroot/gapid/bazel-out/android-arm64-v8a-fastbuild/bin/_nativedeps/daba7a8824506e39f38af05e8f9311cd.so (Permission denied)
Target //:pkg failed to build
INFO: Elapsed time: 66.205s, Critical Path: 49.77s
INFO: 400 processes, local.
FAILED: Build did NOT complete successfully
C:\dev\gapid>bazel build pkg
INFO: Analysed target //:pkg (0 packages loaded).
INFO: Found 1 target...
ERROR: C:/dev/gapid/gapii/apk/BUILD.bazel:17:1: failed to delete output files before executing action: C:/users/pascal/_bazel_pascal/cmrueb7c/execroot/gapid/bazel-out/android-arm64-v8a-fastbuild/bin/gapii/apk/nativedeps/gapii/libgapii.so (Permission denied)
Target //:pkg failed to build
INFO: Elapsed time: 7.697s, Critical Path: 5.14s
INFO: 12 processes, local.
FAILED: Build did NOT complete successfully
C:\dev\gapid>bazel build pkg
INFO: Analysed target //:pkg (0 packages loaded).
INFO: Found 1 target...
ERROR: C:/dev/gapid/gapii/apk/BUILD.bazel:17:1: failed to delete output files before executing action: C:/users/pascal/_bazel_pascal/cmrueb7c/execroot/gapid/bazel-out/android-x86-fastbuild/bin/_nativedeps/077af9bd475f3d8e6a7aac4cd66821f3.so (Permission denied)
Target //:pkg failed to build
INFO: Elapsed time: 46.286s, Critical Path: 43.75s
INFO: 24 processes, local.
FAILED: Build did NOT complete successfully
C:\dev\gapid>bazel build pkg
INFO: Analysed target //:pkg (0 packages loaded).
INFO: Found 1 target...
ERROR: C:/dev/gapid/gapii/apk/BUILD.bazel:17:1: failed to delete output files before executing action: C:/users/pascal/_bazel_pascal/cmrueb7c/execroot/gapid/bazel-out/android-x86-fastbuild/bin/gapii/apk/nativedeps/gapii/libgapii.so (Permission denied)
Target //:pkg failed to build
INFO: Elapsed time: 4.668s, Critical Path: 2.24s
INFO: 4 processes, local.
FAILED: Build did NOT complete successfully
C:\dev\gapid>bazel build pkg
INFO: Analysed target //:pkg (0 packages loaded).
INFO: Found 1 target...
ERROR: C:/dev/gapid/gapii/apk/BUILD.bazel:17:1: failed to delete output files before executing action: C:/users/pascal/_bazel_pascal/cmrueb7c/execroot/gapid/bazel-out/android-armeabi-v7a-fastbuild/bin/_nativedeps/1b627f32a9110058c42c051ea5f0a1be.so (Permission denied)
Target //:pkg failed to build
INFO: Elapsed time: 61.789s, Critical Path: 59.14s
INFO: 10 processes, local.
FAILED: Build did NOT complete successfully
C:\dev\gapid>bazel build pkg
INFO: Analysed target //:pkg (0 packages loaded).
INFO: Found 1 target...
ERROR: C:/dev/gapid/gapii/apk/BUILD.bazel:17:1: failed to delete output files before executing action: C:/users/pascal/_bazel_pascal/cmrueb7c/execroot/gapid/bazel-out/android-armeabi-v7a-fastbuild/bin/gapii/apk/nativedeps/gapii/libgapii.so (Permission denied)
Target //:pkg failed to build
INFO: Elapsed time: 4.479s, Critical Path: 1.89s
INFO: 1 process, local.
FAILED: Build did NOT complete successfully
C:\dev\gapid>bazel build pkg
INFO: Analysed target //:pkg (0 packages loaded).
INFO: Found 1 target...
Target //:pkg up-to-date:
<stuff>
INFO: Elapsed time: 11.864s, Critical Path: 9.38s
INFO: 37 processes, local.
INFO: Build completed successfully, 39 total actions
Sorry, no updates. I tried building GAPID again today with Bazel 0.14.1, ran into some problems that I had to fix (e.g. badly configured PATH, wrong Android SDK installed) and that took up my time. I'll try looking again next Monday.
FYI, https://github.com/bazelbuild/bazel/pull/5520 will make file deletion more robust and be able to delete readonly files (together with a follow-up PR that makes use of the new code).
Those, in combination with some recent commits that make Bazel eagerly close files (https://github.com/bazelbuild/bazel/commit/59f17d6e0550bf63a0b6ef182e2d63474e058ede, https://github.com/bazelbuild/bazel/commit/09d20311d982606093ed881d779bb05a5ee70ed3, https://github.com/bazelbuild/bazel/commit/f11c6bcab87ee8927e23a23d2300900a0922616d), will hopefully solve this problem.
The JavaBuilder persistent worker sometimes holds on to files in the output tree preventing subsequent builds from overwriting those files.
This happens spuriously, but when it hits, the only solution is to kill the worker process.
The process in Process Explorer: