Open moroten opened 6 months ago
@moroten Can you please create a minimal reproducible case so that we can debug into this more easily?
I suspect this comes when a download is allowed due to memory pressure. I can make a try, probably next week.
I tried to abort downloads from repository server side and also send invalid .tar.gz
files, but that does not trigger the code path.
Looking at the profile, I can see that there is a minor GC finishing 5 ms before each of the cancelled Starlark functions. What in the GC is triggering the cancellation? What is actually cancelled and how?
--experimental_worker_for_repo_fetching
was enabled by default in Bazel 7.1.0 which explains why we haven't seen it previously.
There are no memory pressure tests with --experimental_worker_for_repo_fetching
in the code base and I did not create any either in the end.
Looking more at the profiles above, it looks like the download manager is not cancelled when the repository fetch is cancelled. The fetch even starts after the caller has been cancelled: https://github.com/bazelbuild/bazel/blob/9b39ccaa33069c9f5688bef477abcd75e4378f04/src/main/java/com/google/devtools/build/lib/bazel/repository/downloader/DownloadManager.java#L130
The handling of CancellationException
talks about memory pressure as the reason in which case a recursive reattempt is made. Is this the reason why all the //external: ... -> _http_archive_impl -> download_and_extract
ends up in new threads?
https://github.com/bazelbuild/bazel/blob/9b39ccaa33069c9f5688bef477abcd75e4378f04/src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkRepositoryFunction.java#L192-L200
I don't see why because the code below together with result = workerFuture.get();
looks okay.
https://github.com/bazelbuild/bazel/blob/9b39ccaa33069c9f5688bef477abcd75e4378f04/src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkRepositoryFunction.java#L147-L161
@Wyverald Do you know from where the CancellationException
is raised?
Thanks for the debugging work! And sorry for the delay -- I can answer this question specifically:
@Wyverald Do you know from where the
CancellationException
is raised?
On high memory pressure, this logic triggers: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/skyframe/HighWaterMarkLimiter.java;drc=d8c27bfcd37a74dfbf1bdb9a1e3df13af8360a01;l=97
Which eventually calls close()
on the state object: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/RepoFetchingSkyKeyComputeState.java;drc=d8c27bfcd37a74dfbf1bdb9a1e3df13af8360a01;l=92
Which would then cause a CancellationException
to be thrown when we call future.get()
here: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkRepositoryFunction.java;drc=11f0620ffa5a33ebe8a90f8ccb3a71a661806a45;l=185
@moroten Could you check whether this is fixed by https://github.com/bazelbuild/bazel/pull/22748?
Unfortunately, it did not work (see #22748).
Description of the bug:
Occationally, we observe that
download_and_extract
fails withLooking at the attached profile, shows that the download first fails and then all 16 download threads are retrying in parallel, resulting in some of them not being able to clean up properly. All processes 1691-1738 look the same, lots of retries.
We have in other cases also observed temporary download directories inside a fetched external repository (
temp1234...
), sometimes one and sometimes ten directories. This results in cache miss when using aglob
which picks up the temporary download directories.Which category does this issue belong to?
External Dependency
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
It is a race condition, haven't been able to reproduce locally but I see in the logs that it happens daily.
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?release 7.1.0rc2 (patched)
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.Some internal patches unrelated to repository handling.
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No
Any other information, logs, or outputs that you want to share?
No response