bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.13k stars 4.05k forks source link

Spuriously breakage in Gerrit CI after upgrading from 7.0.0rc2 to 7.0.0rc3 #20161

Open davido opened 11 months ago

davido commented 11 months ago

Description of the bug:

Gerrit Code Review is in process of upgrading to bazel 7.0.0.

All was fine after the upgrade to 7.0.0rc2, see: [1].

However, after upgrading to the 7.0.0rc3 we started to see this breakage on our CI:

https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-chrome-latest/40214/console

INFO: Invocation ID: 93ef2f32-774e-40ce-b58d-24dd7a30b758
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /home/jenkins/workspace/Gerrit-verifier-chrome-latest/gerrit/.bazelrc:
  'build' options: --noenable_bzlmod --workspace_status_command=python3 ./tools/workspace_status.py --repository_cache=~/.gerritcodereview/bazel-cache/repository --action_env=PATH --disk_cache=~/.gerritcodereview/bazel-cache/cas --java_language_version=17 --java_runtime_version=remotejdk_17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17 --incompatible_strict_action_env --announce_rc
Computing main repo mapping: 
Loading: 
Loading: 0 packages loaded
Analyzing: target //tools/maven:gen_api_install (1 packages loaded, 0 targets configured)
Analyzing: target //tools/maven:gen_api_install (1 packages loaded, 0 targets configured)
[0 / 1] [Prepa] BazelWorkspaceStatusAction stable-status.txt
INFO: Analyzed target //tools/maven:gen_api_install (1 packages loaded, 1 target configured).
[368 / 527] Executing genrule @jgit//org.eclipse.jgit:jgit; 1s remote-cache, linux-sandbox
[369 / 527] [Prepa] Compiling Java headers external/jgit/org.eclipse.jgit.ssh.apache/libssh-apache-hjar.jar (53 source files)
ERROR: /home/jenkins/workspace/Gerrit-verifier-chrome-latest/gerrit/java/com/google/gerrit/jgit/BUILD:3:13: Compiling Java headers java/com/google/gerrit/jgit/libjgit-hjar.jar (1 source file) failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: cf3b2439c36619f2b6aaadddc55f15ddfd0c96566d22c1c507823ca74ac09732/127311204 for bazel-out/k8-fastbuild/bin/external/rules_java_builtin/toolchains/platformclasspath.jar
ERROR: /home/jenkins/workspace/Gerrit-verifier-chrome-latest/gerrit/java/com/google/gerrit/jgit/BUILD:3:13: Building java/com/google/gerrit/jgit/libjgit.jar (1 source file) failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: cf3b2439c36619f2b6aaadddc55f15ddfd0c96566d22c1c507823ca74ac09732/127311204 for bazel-out/k8-fastbuild/bin/external/rules_java_builtin/toolchains/platformclasspath.jar
Target //tools/maven:gen_api_install failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 6.088s, Critical Path: 4.95s
INFO: 24 processes: 23 internal, 1 linux-sandbox.
ERROR: Build did NOT complete successfully
bazelisk failed to build gen_api_install. Use VERBOSE=1 for more info
Build step 'Execute shell' marked build as failure
Finished: FAILURE

If I downgrade to 7.0.0.rc2, then the build is successful again: [1]

[1] https://gerrit-review.googlesource.com/c/gerrit/+/391534

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I cannot currently reproduce the problem locally ;-(

This command is invoked on the CI:

  tools/maven/api.sh install

That is created a shell script and invoking it to publish Plugin API artifacts in the local maven repository.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

7.0.0rc3

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

All is fine on Bazel 7.0.0.rc2. I am unable to reproduce the problem locally and this cannot bisect.

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

fmeum commented 11 months ago

Could you test with --noreuse_sandbox_directories? That's my best guess without a bisect.

davido commented 11 months ago

Could you test with --noreuse_sandbox_directories? That's my best guess without a bisect.

Unfortunately, with this option the error is still present. Also, I downgraded to 7.0.0.rc2 (from 7.0.0rc3) it's still failing.

davido commented 11 months ago

I also added "-s" option, and produced this verbose output:

https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-chrome-latest/40258/console

[...]
# Configuration: f5d72005e5d4b70683fdbd12ff2cbfb779fc730d4f37f289f17efea5d0e4d042
# Execution platform: @local_config_platform//:host
SUBCOMMAND: # //java/com/google/gerrit/git/testing:testing [action 'Building java/com/google/gerrit/git/testing/libtesting.jar (3 source files)', configuration: f5d72005e5d4b70683fdbd12ff2cbfb779fc730d4f37f289f17efea5d0e4d042, execution platform: @local_config_platform//:host, mnemonic: Javac]
(cd /home/jenkins/.cache/bazel/_bazel_jenkins/67bba20af71044f1eb598ecb44098f26/execroot/gerrit && \
  exec env - \
    LC_CTYPE=en_US.UTF-8 \
    PATH=/home/jenkins/.cache/bazelisk/downloads/bazelbuild/bazel-7.0.0rc2-linux-x86_64/bin:/usr/lib/jvm/java-11-openjdk-amd64/bin:/usr/lib/jvm/java-11-openjdk-amd64/jre/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
  external/remotejdk21_linux/bin/java '--add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.model=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.processing=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.resources=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED' '--add-opens=java.base/java.nio=ALL-UNNAMED' '--add-opens=java.base/java.lang=ALL-UNNAMED' '-Dsun.io.useCanonCaches=false' -XX:-CompactStrings -Xlog:disable '-Xlog:all=warning:stderr:uptime,level,tags' -jar external/remote_java_tools/java_tools/JavaBuilder_deploy.jar @bazel-out/k8-fastbuild/bin/java/com/google/gerrit/git/testing/libtesting.jar-0.params @bazel-out/k8-fastbuild/bin/java/com/google/gerrit/git/testing/libtesting.jar-1.params)
# Configuration: f5d72005e5d4b70683fdbd12ff2cbfb779fc730d4f37f289f17efea5d0e4d042
# Execution platform: @local_config_platform//:host
SUBCOMMAND: # //java/com/google/gerrit/jgit:jgit [action 'Building java/com/google/gerrit/jgit/libjgit.jar (1 source file)', configuration: f5d72005e5d4b70683fdbd12ff2cbfb779fc730d4f37f289f17efea5d0e4d042, execution platform: @local_config_platform//:host, mnemonic: Javac]
(cd /home/jenkins/.cache/bazel/_bazel_jenkins/67bba20af71044f1eb598ecb44098f26/execroot/gerrit && \
  exec env - \
    LC_CTYPE=en_US.UTF-8 \
    PATH=/home/jenkins/.cache/bazelisk/downloads/bazelbuild/bazel-7.0.0rc2-linux-x86_64/bin:/usr/lib/jvm/java-11-openjdk-amd64/bin:/usr/lib/jvm/java-11-openjdk-amd64/jre/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
  external/remotejdk21_linux/bin/java '--add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.model=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.processing=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.resources=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED' '--add-opens=java.base/java.nio=ALL-UNNAMED' '--add-opens=java.base/java.lang=ALL-UNNAMED' '-Dsun.io.useCanonCaches=false' -XX:-CompactStrings -Xlog:disable '-Xlog:all=warning:stderr:uptime,level,tags' -jar external/remote_java_tools/java_tools/JavaBuilder_deploy.jar @bazel-out/k8-fastbuild/bin/java/com/google/gerrit/jgit/libjgit.jar-0.params @bazel-out/k8-fastbuild/bin/java/com/google/gerrit/jgit/libjgit.jar-1.params)
# Configuration: f5d72005e5d4b70683fdbd12ff2cbfb779fc730d4f37f289f17efea5d0e4d042
# Execution platform: @local_config_platform//:host
SUBCOMMAND: # //java/com/google/gerrit/acceptance/config:config [action 'Building java/com/google/gerrit/acceptance/config/libconfig.jar (7 source files) and running annotation processors (AutoAnnotationProcessor, AutoValueProcessor, AutoOneOfProcessor)', configuration: f5d72005e5d4b70683fdbd12ff2cbfb779fc730d4f37f289f17efea5d0e4d042, execution platform: @local_config_platform//:host, mnemonic: Javac]
(cd /home/jenkins/.cache/bazel/_bazel_jenkins/67bba20af71044f1eb598ecb44098f26/execroot/gerrit && \
  exec env - \
    LC_CTYPE=en_US.UTF-8 \
    PATH=/home/jenkins/.cache/bazelisk/downloads/bazelbuild/bazel-7.0.0rc2-linux-x86_64/bin:/usr/lib/jvm/java-11-openjdk-amd64/bin:/usr/lib/jvm/java-11-openjdk-amd64/jre/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
  external/remotejdk21_linux/bin/java '--add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.model=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.processing=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.resources=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED' '--add-opens=java.base/java.nio=ALL-UNNAMED' '--add-opens=java.base/java.lang=ALL-UNNAMED' '-Dsun.io.useCanonCaches=false' -XX:-CompactStrings -Xlog:disable '-Xlog:all=warning:stderr:uptime,level,tags' -jar external/remote_java_tools/java_tools/JavaBuilder_deploy.jar @bazel-out/k8-fastbuild/bin/java/com/google/gerrit/acceptance/config/libconfig.jar-0.params @bazel-out/k8-fastbuild/bin/java/com/google/gerrit/acceptance/config/libconfig.jar-1.params)
# Configuration: f5d72005e5d4b70683fdbd12ff2cbfb779fc730d4f37f289f17efea5d0e4d042
# Execution platform: @local_config_platform//:host
ERROR: /home/jenkins/workspace/Gerrit-verifier-chrome-latest/gerrit/java/com/google/gerrit/jgit/BUILD:3:13: Compiling Java headers java/com/google/gerrit/jgit/libjgit-hjar.jar (1 source file) failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 5cb087fa259562b09dfdb79380f82501849de07f77ea3eb52941303af7532e7e/138756716 for bazel-out/k8-fastbuild/bin/external/rules_java_builtin/toolchains/platformclasspath.jar
ERROR: /home/jenkins/.cache/bazel/_bazel_jenkins/67bba20af71044f1eb598ecb44098f26/external/jgit/org.eclipse.jgit.http.server/BUILD:5:13: Building external/jgit/org.eclipse.jgit.http.server/libjgit-servlet-class.jar (35 source files) failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 5cb087fa259562b09dfdb79380f82501849de07f77ea3eb52941303af7532e7e/138756716 for bazel-out/k8-fastbuild/bin/external/rules_java_builtin/toolchains/platformclasspath.jar
Target //tools/maven:gen_api_install failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 8.404s, Critical Path: 7.11s
INFO: 24 processes: 23 internal, 1 linux-sandbox.
ERROR: Build did NOT complete successfully
bazelisk failed to build gen_api_install. Use VERBOSE=1 for more info
Build step 'Execute shell' marked build as failure
Finished: FAILURE
fmeum commented 11 months ago

@tjgq Do you have an idea?

fmeum commented 11 months ago

@bazel-io flag

tjgq commented 11 months ago

@davido Do I understand it correctly that you're building with a disk cache, but not with a remote cache? Is this build clean or incremental? Do you have any sort of process that removes entries from the disk cache between builds?

keertk commented 11 months ago

@bazel-io fork 7.0.0

coeuvre commented 11 months ago

From the CI log, it seems like you are using remote cache and these errors were caused by remote cache eviction. Can you check whether adding flag --experimental_remote_cache_eviction_retries=5 resolves the issue?

meisterT commented 11 months ago

@coeuvre how can this happen just with the local disk cache? Race between multiple workers?

coeuvre commented 11 months ago

I think they are using remote cache. The flag was passed with env:

[EnvInject] - Injecting as environment variables the properties content 
BAZEL_OPTS=--remote_cache=https://gerrit-ci.gerritforge.com/cache

Also, xxx remote cache hit indicates remote cache. For disk cache it would be xxx disk cache hit.

davido commented 11 months ago

First of all we are using a combination of RBE and local build.

Some stuff we can only test locally. The failing part is built locally on GCP-machines.

We have both options, disc cache and remote cache, see, e.g.

BAZEL_OPTS=--remote_cache=https://gerrit-ci.gerritforge.com/cache

However, we have this hidden logic on the CI side to take remote cache out of the picture, if .bazelversion file was changed:

if git show --diff-filter=AM --name-only --pretty="" HEAD \| grep -q .bazelversion
then
  export BAZEL_OPTS=""
fi

This is the part of the CI that was failing:

https://gerrit.googlesource.com/gerrit-ci-scripts/+/refs/heads/master/jenkins/gerrit-bazel-build.sh#35

bazelisk build $BAZEL_OPTS plugins:core release api

@lucamilanesio Are you aware of any cache evictions on the remote cache side recently?

davido commented 11 months ago

So, to verify, that remote cache contributes to the problem, I upgraded (again) the Bazel version from 7.0.0rc2 to 7.0.0rc3, and uploaded a new patch set (22). As explained in my previous comment, this would skip remote cache usage and the verification was successful: [1].

I'm going to remove the changes in .bazelversion and add the option --experimental_remote_cache_eviction_retries=5, as suggested by @coeuvre .

[1] https://gerrit-review.googlesource.com/c/gerrit/+/387837/22

davido commented 11 months ago

@coeuvre, adding --experimental_remote_cache_eviction_retries options fixed the build.

tjgq commented 11 months ago

@davido Can you confirm whether entries can spuriously disappear from your disk and/or remote cache in between builds? If they can, then you must use --experimental_remote_cache_eviction_retries, possibly in conjunction with --experimental_remote_cache_lease_extension. Otherwise, there might be a bug in Bazel.

davido commented 11 months ago

@lucamilanesio Can you help to answer the @tjgq 's question?

meteorcloudy commented 11 months ago

Since it's still unclear if this is a Bazel bug, I'll remove this bug as a release blocker for 7.0. Closing https://github.com/bazelbuild/bazel/issues/20175.

davido commented 11 months ago

@meteorcloudy Agreed. Let's close this then as not an issue.

lucamilanesio commented 10 months ago

@davido Can you confirm whether entries can spuriously disappear from your disk and/or remote cache in between builds?

They cannot disappear from the local disk, however, once a day during the remote cache cleanups, they can be removed remotely. The step that is failing though did not use any remote cache: how is that possible that Bazel would assume that the cache is remote if there isn't a remote cache configured?

It looks like the local cache "remembers" that it was fed by a remote cache, because the previous step actually used a remote cache for the intial build.

If they can, then you must use --experimental_remote_cache_eviction_retries, possibly in conjunction with --experimental_remote_cache_lease_extension. Otherwise, there might be a bug in Bazel.

Well, but that isn't the case, as mentioned above.

If I add the remote cache URL in the .bazelrc for making sure that is always used in all invocations, the problem disappear. Has something changed in the remote cache management between Bazel 7.0.0-rc2 and 7.0.0-rc3?

davido commented 10 months ago

Reopening the issue, as we are seeing this on Gerrit CI again and this downstream issue with priority 0 was filed: 1.

Excerpt from downstream issue:

The build steps that are executed for the validation are:

#0
export BAZEL_OPTS=--remote_cache=https://gerrit-ci.gerritforge.com/cache
#1
bazelisk build $BAZEL_OPTS plugins:core release api
#2
tools/maven/api.sh install
#3
tools/eclipse/project.py --bazel bazelisk

Only the first build command above is using remote cache, the subsequent commands don't use remote cache, and started to consistently fail on Gerrit CI after bump of Bazel version from 7.0.0-rc2 and 7.0.0-rc3.

The second command: tools/maven/api.sh is here: 2, and is actually running this build command (without remote-cache usage):

bazelisk build //tools/maven:gen_api_install

Which is failing with this error now:

com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 892c651b04360ae932e9843f7d2233e4476e5f60dd835a865fb49bf7a48f6e66/230925 for bazel-out/k8-fastbuild/bin/external/sshd-sftp/jar/_ijar/jar/sshd-sftp/jar/sshd-sftp-2.10.0-ijar.jar
Target //tools/maven:gen_api_install failed to build

@coeuvre @tjgq @meteorcloudy @fmeum Any clue what is going on here and how can we further track it down?

In fact, passing: --experimental_remote_cache_eviction_retries=5 helps, but this is a wrong thing to do as a workaround to fix a build command, that shouldn't use remote cache in the first place, isn't it?

Also note, that if we pass the remote cache option to all three build commands above, they all succeed.

So, in both cases (with and without remote cache): we are using repository cache and disk cache, as part of the .bazelrc:

--repository_cache=~/.gerritcodereview/bazel-cache/repository --disk_cache=~/.gerritcodereview/bazel-cache/cas

^^^ Can it be somehow related?

davido commented 10 months ago

I can reproduce the issue locally now. As assumed, the problem is related to the disk cache.

Here are the steps:

  1. Install remote cache https://github.com/buchgr/bazel-remote
  2. I used docker image with this command:
$ docker pull buchgr/bazel-remote-cache
$ docker run -u 1000:1000 -v /path/to/cache/dir:/data \
    -p 9090:8080 -p 9092:9092 buchgr/bazel-remote-cache \
    --max_size 5
  1. Build gerrit@HEAD, currently on Bazel release 7.0.0 using the remote cache, note that disk cache is used as well:
$ bazelisk build --remote_cache=http://server:9090 plugins:core release api
  1. Wipe out the disk cache, note that the disk cache specified in gerrit/.bazelrc file is located in ~/.gerritcodereview/bazel-cache/cas
$ rm -rf ~/.gerritcodereview/bazel-cache/cas/
  1. Build the gerrit without using the remote-cache:
davido@localhost:~/projects/gerrit (master %>)$ tools/eclipse/project.py --bazel bazelisk
INFO: Invocation ID: 6084a97c-1b8d-4850-bcb1-f37c2f84fa37
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'info' from /home/davido/projects/gerrit/.bazelrc:
  Inherited 'common' options: --noenable_bzlmod
INFO: Reading rc options for 'info' from /home/davido/projects/gerrit/.bazelrc:
  Inherited 'build' options: --workspace_status_command=python3 ./tools/workspace_status.py --repository_cache=~/.gerritcodereview/bazel-cache/repository --action_env=PATH --disk_cache=~/.gerritcodereview/bazel-cache/cas --java_language_version=17 --java_runtime_version=remotejdk_17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17 --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --incompatible_strict_action_env --announce_rc
INFO: Invocation ID: 3e774f39-267e-4841-8a37-b1e2890edb39
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=147
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
  Inherited 'common' options: --noenable_bzlmod
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
  'build' options: --workspace_status_command=python3 ./tools/workspace_status.py --repository_cache=~/.gerritcodereview/bazel-cache/repository --action_env=PATH --disk_cache=~/.gerritcodereview/bazel-cache/cas --java_language_version=17 --java_runtime_version=remotejdk_17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17 --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --incompatible_strict_action_env --announce_rc
INFO: Analyzed target //tools/eclipse:main_classpath_collect (10 packages loaded, 182 targets configured).
INFO: Found 1 target...
Target //tools/eclipse:main_classpath_collect up-to-date:
  bazel-bin/tools/eclipse/main_classpath_collect.runtime_classpath
INFO: Elapsed time: 1.093s, Critical Path: 0.81s
INFO: 2 processes: 2 internal.
INFO: Build completed successfully, 2 total actions
INFO: Invocation ID: 578b1a90-ad9f-478b-98f4-20818be06888
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=147
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
  Inherited 'common' options: --noenable_bzlmod
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
  'build' options: --workspace_status_command=python3 ./tools/workspace_status.py --repository_cache=~/.gerritcodereview/bazel-cache/repository --action_env=PATH --disk_cache=~/.gerritcodereview/bazel-cache/cas --java_language_version=17 --java_runtime_version=remotejdk_17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17 --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --incompatible_strict_action_env --announce_rc
INFO: Analyzed target //tools/eclipse:autovalue_classpath_collect (0 packages loaded, 7 targets configured).
INFO: Found 1 target...
Target //tools/eclipse:autovalue_classpath_collect up-to-date:
  bazel-bin/tools/eclipse/autovalue_classpath_collect.runtime_classpath
INFO: Elapsed time: 1.111s, Critical Path: 0.69s
INFO: 2 processes: 2 internal.
INFO: Build completed successfully, 2 total actions
INFO: Invocation ID: 0c23b3fe-303d-4076-97c6-488fbf009f94
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=147
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
  Inherited 'common' options: --noenable_bzlmod
INFO: Reading rc options for 'build' from /home/davido/projects/gerrit/.bazelrc:
  'build' options: --workspace_status_command=python3 ./tools/workspace_status.py --repository_cache=~/.gerritcodereview/bazel-cache/repository --action_env=PATH --disk_cache=~/.gerritcodereview/bazel-cache/cas --java_language_version=17 --java_runtime_version=remotejdk_17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17 --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --incompatible_strict_action_env --announce_rc
INFO: Analyzed target //tools/eclipse:classpath (0 packages loaded, 1 target configured).
ERROR: /home/davido/projects/gerrit/proto/testing/BUILD:4:14: Generating proto_library //proto/testing:test_proto failed: Failed to fetch blobs because they do not exist remotely.: 3 errors during bulk transfer:
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 74c97c32ccbc58b7d77ca61e6ec0d576d9f47173b3360c4f31e73a265162cd1f/4388096 for bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/com_google_protobuf/protoc
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 74c97c32ccbc58b7d77ca61e6ec0d576d9f47173b3360c4f31e73a265162cd1f/4388096 for bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/com_google_protobuf/protoc
com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 74c97c32ccbc58b7d77ca61e6ec0d576d9f47173b3360c4f31e73a265162cd1f/4388096 for bazel-out/k8-opt-exec-ST-13d3ddad9198/bin/external/com_google_protobuf/protoc
Target //tools/eclipse:classpath failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.845s, Critical Path: 0.28s
INFO: 9 processes: 4 internal, 2 linux-sandbox, 3 worker.
ERROR: Build did NOT complete successfully
lucamilanesio commented 10 months ago

Good catch @davido, I truly believe that Bazel keeps some local reference on the disk cache that it was populated from a remote source. When you do not specify the remote source anymore in the subsequent commands, Bazel blows up with the error you've shown, which is misleading because it isn't really a network transfer problem at all.

I wrongly assumed that we had issues with our remote cache storage, but that wasn't the case.

coeuvre commented 10 months ago

Thanks for the repro! I am looking into the issue now.

coeuvre commented 10 months ago

I understand the issue now. Since 7.0.0, Bazel uses --remote_download_toplevel by default which means intermediate outputs will not be downloaded during the build.

Looking at the error builds in the CI, the scenario might be:

  1. In the first build with both disk and remote cache, Bazel hit the remote cache but didn't download, e.g., bazel-out/k8-fastbuild/bin/external/sshd-sftp/jar/_ijar/jar/sshd-sftp/jar/sshd-sftp-2.10.0-ijar.jar due to --remote_download_toplevel. Both disk cache and Bazel's output tree are not populated with this file. However, the action result is downloaded and stored in the disk cache.
  2. In a following build with disk cache only, Bazel can hit the local disk cache for the action result. But when Bazel needs to download the output file (because it's an input to downstream actions), it cannot download it from the disk cache. So CacheNotFoundException.

For the repro, wiping out the disk cache could also trigger the error for the same reason: Bazel didn't download outputs during last build, when it needs the output now but fails to "download" from disk cache, it reports CacheNotFoundException.

Internally, Bazel indeed keeps some references to the disk or remote cache because when building with -remote_download_[toplevel|minimal], Bazel won't download some of the outputs. It only remember the metadata so that the outputs can be re-downloaded later.

From the CI setup, it seems that you want to populate the disk cache using remote cache during the first build. If so, I would suggest setting --remote_download_all for the first build. Otherwise, --experimental_remote_cache_eviction_retries is the right flag for this issue.

coeuvre commented 10 months ago

This is more like a documentation issue, not a real bug in Bazel. Downgrading the priority.

lucamilanesio commented 10 months ago

This is more like a documentation issue, not a real bug in Bazel. Downgrading the priority.

Should this be considered a breaking change in Bazel 7 compared to 6? I guess the default behaviour has changed in a non-backward compatible way. Thanks for the suggestions, I am adding the --remote_download_all in the initial build so that all the remote resources needed are loaded locally.

That doesn't impact our build time because we always start the build with a pre-warmed Docker image that has an initial build completed. I have actually noticed that the image built was very small compared to the previous releases, which means that a lot of data was not stored anymore locally.

I agree to downgrading to a P2.

coeuvre commented 10 months ago

Should this be considered a breaking change in Bazel 7 compared to 6?

Yes, it's a breaking change. It is highlighted in the release notes: https://blog.bazel.build/2023/12/11/bazel-7-release.html#build-without-the-bytes-bwob, we probably should've made it more clear that it's a breaking change.

xiemotongye commented 9 months ago

I'm trying to upgrade bazel 7.0 in our iOS project. All things work fine in bazel 6.3.2.

But when I upgraded bazel to 7.0, I also met the same issue. As mentioned above, it seems that this problem occurs when both disk and remote cache are used. But I'm pretty sure I'm not using disk cache and RBE.

Here is the outputs:

'build' options: --verbose_failures --announce_rc --apple_platform_type=ios --show_progress_rate_limit=5 --output_filter=^$ --ios_minimum_os=11.0 --macos_minimum_os=12.0 --host_macos_minimum_os=12.0 --use_top_level_targets_for_symlinks --incompatible_strict_action_env --define=apple.compress_ipa=true --experimental_cc_implementation_deps --experimental_guard_against_concurrent_changes --profile=bazel-profile --experimental_objc_include_scanning --experimental_remote_cache_compression --features=oso_prefix_is_pwd --features=layering_check --features=swift.skip_function_bodies_for_derived_files --features=swift.minimal_deps --features=swift.layering_check --features=swift.module_map_no_private_headers --remote_timeout=100s --reuse_sandbox_directories --spawn_strategy=local --genrule_strategy=local
INFO: Reading rc options for 'build' from /Volumes/workspace/grunner/builds/Hwyyfv8c/0/ios/loktar/ci.bazelrc:
  'build' options: --objc_enable_binary_stripping --objc_generate_linkmap --strip=always --apple_generate_dsym --remote_local_fallback --local_cpu_resources=HOST_CPUS*.9 --features=swift.use_explicit_swift_module_map --remote_cache=http://my-remote-cache.co/ios
INFO: Found applicable config definition build:strict in file /Volumes/workspace/grunner/builds/Hwyyfv8c/0/ios/loktar/rules.bazelrc: --copt=-Werror
Computing main repo mapping: 
Loading: 
Loading: 0 packages loaded
Analyzing: target //srcs:app (0 packages loaded, 0 targets configured)
Analyzing: target //srcs:app (0 packages loaded, 0 targets configured)
[0 / 1] [Prepa] BazelWorkspaceStatusAction stable-status.txt
INFO: Analyzed target //srcs:app (0 packages loaded, 0 targets configured).
[9,975 / 27,591] AssetCatalogCompile srcs/app-intermediates/xcassets; 4s local ... (55 actions, 1 running)
[17,013 / 30,781] AssetCatalogCompile srcs/app-intermediates/xcassets; 9s local ... (55 actions, 1 running)
[23,414 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 14s local ... (49 actions, 1 running)
[25,897 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 19s local ... (48 actions, 1 running)
[28,437 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 24s local ... (45 actions, 1 running)
[30,608 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 29s local ... (44 actions, 1 running)
[32,933 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 34s local ... (49 actions, 1 running)
ERROR: /Volumes/workspace/grunner/builds/Hwyyfv8c/0/ios/loktar/srcs/BUILD:601:16: SwiftStdlibCopy srcs/app-intermediates/swiftlibs failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: f5f2f1aa89a7d08abd93a7b1a2a21a6621b01a93314b40360c5bd1c44e6e2cb3/271080288 for bazel-out/ios_arm64-opt-ios-arm64-min11.0-applebin_ios-ST-ae93c8b2d27f/bin/srcs/app_bin
ERROR: /Volumes/workspace/grunner/builds/Hwyyfv8c/0/ios/loktar/srcs/BUILD:601:16: SwiftStdlibCopy srcs/app-intermediates/swiftlibs_for_swiftsupport failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: f5f2f1aa89a7d08abd93a7b1a2a21a6621b01a93314b40360c5bd1c44e6e2cb3/271080288 for bazel-out/ios_arm64-opt-ios-arm64-min11.0-applebin_ios-ST-ae93c8b2d27f/bin/srcs/app_bin
Target //srcs:app failed to build

--remote_download_all worked for me, but --experimental_remote_cache_eviction_retries=5 didn't work. I believe it has something to do with BwoB. But I have no idea why this happened without using disk cache.

Additional notes: I'm using a no-remote tag in my top-level target:

ios_application(
    name = "app",
    ...
    tags = ["no-remote"],
)
xiemotongye commented 9 months ago

I'm trying to upgrade bazel 7.0 in our iOS project. All things work fine in bazel 6.3.2.

But when I upgraded bazel to 7.0, I also met the same issue. As mentioned above, it seems that this problem occurs when both disk and remote cache are used. But I'm pretty sure I'm not using disk cache and RBE.

Here is the outputs:

'build' options: --verbose_failures --announce_rc --apple_platform_type=ios --show_progress_rate_limit=5 --output_filter=^$ --ios_minimum_os=11.0 --macos_minimum_os=12.0 --host_macos_minimum_os=12.0 --use_top_level_targets_for_symlinks --incompatible_strict_action_env --define=apple.compress_ipa=true --experimental_cc_implementation_deps --experimental_guard_against_concurrent_changes --profile=bazel-profile --experimental_objc_include_scanning --experimental_remote_cache_compression --features=oso_prefix_is_pwd --features=layering_check --features=swift.skip_function_bodies_for_derived_files --features=swift.minimal_deps --features=swift.layering_check --features=swift.module_map_no_private_headers --remote_timeout=100s --reuse_sandbox_directories --spawn_strategy=local --genrule_strategy=local
INFO: Reading rc options for 'build' from /Volumes/workspace/grunner/builds/Hwyyfv8c/0/ios/loktar/ci.bazelrc:
  'build' options: --objc_enable_binary_stripping --objc_generate_linkmap --strip=always --apple_generate_dsym --remote_local_fallback --local_cpu_resources=HOST_CPUS*.9 --features=swift.use_explicit_swift_module_map --remote_cache=http://my-remote-cache.co/ios
INFO: Found applicable config definition build:strict in file /Volumes/workspace/grunner/builds/Hwyyfv8c/0/ios/loktar/rules.bazelrc: --copt=-Werror
Computing main repo mapping: 
Loading: 
Loading: 0 packages loaded
Analyzing: target //srcs:app (0 packages loaded, 0 targets configured)
Analyzing: target //srcs:app (0 packages loaded, 0 targets configured)
[0 / 1] [Prepa] BazelWorkspaceStatusAction stable-status.txt
INFO: Analyzed target //srcs:app (0 packages loaded, 0 targets configured).
[9,975 / 27,591] AssetCatalogCompile srcs/app-intermediates/xcassets; 4s local ... (55 actions, 1 running)
[17,013 / 30,781] AssetCatalogCompile srcs/app-intermediates/xcassets; 9s local ... (55 actions, 1 running)
[23,414 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 14s local ... (49 actions, 1 running)
[25,897 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 19s local ... (48 actions, 1 running)
[28,437 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 24s local ... (45 actions, 1 running)
[30,608 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 29s local ... (44 actions, 1 running)
[32,933 / 33,172] AssetCatalogCompile srcs/app-intermediates/xcassets; 34s local ... (49 actions, 1 running)
ERROR: /Volumes/workspace/grunner/builds/Hwyyfv8c/0/ios/loktar/srcs/BUILD:601:16: SwiftStdlibCopy srcs/app-intermediates/swiftlibs failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: f5f2f1aa89a7d08abd93a7b1a2a21a6621b01a93314b40360c5bd1c44e6e2cb3/271080288 for bazel-out/ios_arm64-opt-ios-arm64-min11.0-applebin_ios-ST-ae93c8b2d27f/bin/srcs/app_bin
ERROR: /Volumes/workspace/grunner/builds/Hwyyfv8c/0/ios/loktar/srcs/BUILD:601:16: SwiftStdlibCopy srcs/app-intermediates/swiftlibs_for_swiftsupport failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: f5f2f1aa89a7d08abd93a7b1a2a21a6621b01a93314b40360c5bd1c44e6e2cb3/271080288 for bazel-out/ios_arm64-opt-ios-arm64-min11.0-applebin_ios-ST-ae93c8b2d27f/bin/srcs/app_bin
Target //srcs:app failed to build

--remote_download_all worked for me, but --experimental_remote_cache_eviction_retries=5 didn't work. I believe it has something to do with BwoB. But I have no idea why this happened without using disk cache.

Additional notes: I'm using a no-remote tag in my top-level target:

ios_application(
    name = "app",
    ...
    tags = ["no-remote"],
)

passing --experimental_remote_downloader_local_fallback also helps

fmeum commented 5 months ago

@coeuvre Just ran into this with bazel run -c opt //src/java_tools/buildjar/java/com/google/devtools/build/java/turbine:turbine_benchmark --disk_cache=some/path, which worked in the past and only uses --disk_cache internally. It changes the value to a special directory it creates and then reproducibly runs into the "Missing digest" error. This seems like more than a documentation issue.

luispadron commented 5 months ago

Just +1 that im seeing a similar issue:

11:01:10 ERROR: Foo/BUILD.bazel:11:15: Compiling Foo.c failed: unable to finalize action: Missing digest: <number>/<number> for bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-<sha>/bin/path/to/Foo.d

Our setup is a bit different though as were testing with 7.1.1 and:

How can we have issues downloading here since BwtB is disabled?