bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.04k stars 4.03k forks source link

Multithreaded download_and_extract (http_archive) buggy on retry #21773

Open moroten opened 6 months ago

moroten commented 6 months ago

Description of the bug:

Occationally, we observe that download_and_extract fails with

ERROR: /.../WORKSPACE:40:13: fetching http_archive rule //external:redacted: java.io.IOException: /.../external/redacted (Directory not empty)

Looking at the attached profile, shows that the download first fails and then all 16 download threads are retrying in parallel, resulting in some of them not being able to clean up properly. All processes 1691-1738 look the same, lots of retries.

download_and_extract_zoom download_manager

We have in other cases also observed temporary download directories inside a fetched external repository (temp1234...), sometimes one and sometimes ten directories. This results in cache miss when using a glob which picks up the temporary download directories.


.../external/redacted$ ls -la
drwxr-xr-x    Mar 18 10:52 .
drwxr-xr-x    Mar 18 10:55 ..
-rwxr-xr-x    Mar 18 10:52 BUILD.bazel
drwxr-xr-x    Mar 18 10:52 something
drwxr-xr-x    Mar 18 10:50 temp9248233320309694695
-rwxr-xr-x    Mar 18 10:52 WORKSPACE
.../external/redacted$ stat temp9248233320309694695
  File: temp9248233320309694695
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 16384054    Links: 2
Access: (0755/drwxr-xr-x)  Uid: ( )   Gid: ( )
Access: 2024-03-18 10:52:47.218611936 +0100
Modify: 2024-03-18 10:50:04.623888785 +0100
Change: 2024-03-18 10:50:04.623888785 +0100
Birth: 2024-03-18 10:50:04.623888785 +0100
.../external/redacted$ stat .
  File: .
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 16384333    Links: 4
Access: (0755/drwxr-xr-x)  Uid: ( )   Gid: ( )
Access: 2024-03-18 10:52:47.214611868 +0100
Modify: 2024-03-18 10:52:47.210611802 +0100
Change: 2024-03-18 10:52:47.210611802 +0100
Birth: 2024-03-18 10:49:38.239446336 +0100

Which category does this issue belong to?

External Dependency

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

It is a race condition, haven't been able to reproduce locally but I see in the logs that it happens daily.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 7.1.0rc2 (patched)

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

Some internal patches unrelated to repository handling.

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No

Any other information, logs, or outputs that you want to share?

No response

meteorcloudy commented 6 months ago

@moroten Can you please create a minimal reproducible case so that we can debug into this more easily?

moroten commented 6 months ago

I suspect this comes when a download is allowed due to memory pressure. I can make a try, probably next week.

moroten commented 5 months ago

I tried to abort downloads from repository server side and also send invalid .tar.gz files, but that does not trigger the code path.

Looking at the profile, I can see that there is a minor GC finishing 5 ms before each of the cancelled Starlark functions. What in the GC is triggering the cancellation? What is actually cancelled and how?

--experimental_worker_for_repo_fetching was enabled by default in Bazel 7.1.0 which explains why we haven't seen it previously.

There are no memory pressure tests with --experimental_worker_for_repo_fetching in the code base and I did not create any either in the end.

Looking more at the profiles above, it looks like the download manager is not cancelled when the repository fetch is cancelled. The fetch even starts after the caller has been cancelled: https://github.com/bazelbuild/bazel/blob/9b39ccaa33069c9f5688bef477abcd75e4378f04/src/main/java/com/google/devtools/build/lib/bazel/repository/downloader/DownloadManager.java#L130

The handling of CancellationException talks about memory pressure as the reason in which case a recursive reattempt is made. Is this the reason why all the //external: ... -> _http_archive_impl -> download_and_extract ends up in new threads? https://github.com/bazelbuild/bazel/blob/9b39ccaa33069c9f5688bef477abcd75e4378f04/src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkRepositoryFunction.java#L192-L200

I don't see why because the code below together with result = workerFuture.get(); looks okay. https://github.com/bazelbuild/bazel/blob/9b39ccaa33069c9f5688bef477abcd75e4378f04/src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkRepositoryFunction.java#L147-L161

moroten commented 5 months ago

@Wyverald Do you know from where the CancellationException is raised?

Wyverald commented 5 months ago

Thanks for the debugging work! And sorry for the delay -- I can answer this question specifically:

@Wyverald Do you know from where the CancellationException is raised?

On high memory pressure, this logic triggers: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/skyframe/HighWaterMarkLimiter.java;drc=d8c27bfcd37a74dfbf1bdb9a1e3df13af8360a01;l=97

Which eventually calls close() on the state object: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/RepoFetchingSkyKeyComputeState.java;drc=d8c27bfcd37a74dfbf1bdb9a1e3df13af8360a01;l=92

Which would then cause a CancellationException to be thrown when we call future.get() here: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkRepositoryFunction.java;drc=11f0620ffa5a33ebe8a90f8ccb3a71a661806a45;l=185

fmeum commented 3 months ago

@moroten Could you check whether this is fixed by https://github.com/bazelbuild/bazel/pull/22748?

moroten commented 3 months ago

Unfortunately, it did not work (see #22748).