bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.27k stars 4.08k forks source link

[Bazel CI] Tests are slow and flaky on macOS arm64 #23726

Closed meteorcloudy closed 1 month ago

meteorcloudy commented 1 month ago

//src/test/shell/bazel:bazel_bootstrap_distfile_tar_test and //src/test/shell/bazel:bazel_determinism_test are extremely slow: https://buildkite.com/bazel/google-bazel-presubmit/builds/84154#01922292-cca6-4e56-939b-05a5c7b59da1

Flaky tests are very frequent: https://buildkite.com/bazel/google-bazel-presubmit/builds/84157#019222c1-653c-4a75-bdf7-7f76d5937946

//src/test/py/bazel:cc_import_test                                        FLAKY, failed in 1 out of 2 in 69.9s
  Stats over 2 runs: max = 69.9s, min = 55.5s, avg = 62.7s, dev = 7.2s
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/py/bazel/cc_import_test/test_attempts/attempt_1.log
//src/test/shell/integration:config_stripped_outputs_test                 FLAKY, failed in 1 out of 2 in 126.8s
  Stats over 2 runs: max = 126.8s, min = 78.5s, avg = 102.6s, dev = 24.2s
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/config_stripped_outputs_test/test_attempts/attempt_1.log
//src/test/shell/bazel:bazel_rules_java_override_test                     FLAKY, failed in 2 out of 3 in 52.4s
  Stats over 3 runs: max = 52.4s, min = 8.7s, avg = 31.2s, dev = 17.9s
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/bazel/bazel_rules_java_override_test/test_attempts/attempt_1.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/bazel/bazel_rules_java_override_test/test_attempts/attempt_2.log
//src/test/shell/bazel:build_files_test                                   FLAKY, failed in 2 out of 3 in 49.8s
  Stats over 3 runs: max = 49.8s, min = 25.1s, avg = 41.6s, dev = 11.6s
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/bazel/build_files_test/test_attempts/attempt_1.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/bazel/build_files_test/test_attempts/attempt_2.log
//src/test/shell/bazel:bazel_coverage_java_jdk21_toolchain_released_test FAILED in 3 out of 3 in 248.6s
  Stats over 3 runs: max = 248.6s, min = 199.0s, avg = 221.1s, dev = 20.6s
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/bazel/bazel_coverage_java_jdk21_toolchain_released_test/test.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/bazel/bazel_coverage_java_jdk21_toolchain_released_test/test_attempts/attempt_1.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/bazel/bazel_coverage_java_jdk21_toolchain_released_test/test_attempts/attempt_2.log
//src/test/shell/integration:test_test                                   FAILED in 3 out of 3 in 216.0s
  Stats over 3 runs: max = 216.0s, min = 158.5s, avg = 191.3s, dev = 24.2s
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/test_test/test.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/test_test/test_attempts/attempt_1.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/test_test/test_attempts/attempt_2.log
//src/test/java/com/google/devtools/build/lib/rules/config:ConfigRulesTests PASSED in 34.3s
  Stats over 5 runs: max = 34.3s, min = 17.8s, avg = 25.3s, dev = 7.0s
//src/test/shell/integration:bazel_sandboxed_worker_test                 FAILED in 6 out of 7 in 146.6s
  Stats over 7 runs: max = 146.6s, min = 98.7s, avg = 125.7s, dev = 15.4s
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/bazel_sandboxed_worker_test/shard_1_of_3/test.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/bazel_sandboxed_worker_test/shard_1_of_3/test_attempts/attempt_1.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/bazel_sandboxed_worker_test/shard_1_of_3/test_attempts/attempt_2.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/bazel_sandboxed_worker_test/shard_2_of_3/test.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/bazel_sandboxed_worker_test/shard_2_of_3/test_attempts/attempt_1.log
  /private/var/tmp/_bazel_buildkite/00e02099ed8d75d374b9c12be02eaf4c/execroot/_main/bazel-out/darwin_arm64-fastbuild/testlogs/src/test/shell/integration/bazel_sandboxed_worker_test/shard_2_of_3/test_attempts/attempt_2.log
//src/test/java/com/google/devtools/build/lib/query2/engine:AllTests     PASSED in 16.9s
  Stats over 10 runs: max = 16.9s, min = 6.2s, avg = 11.1s, dev = 3.1s
//src/test/shell/bazel/remote:remote_execution_test                      FAILED in 9 out of 12 in 256.6s
meteorcloudy commented 1 month ago

Seeing gRPC server failed to bind to IPv4 and IPv6 localhosts on port 0: [IPv4] Failed to bind to address /127.0.0.1:0 from the test log again. Maybe related to https://github.com/bazelbuild/bazel/issues/20743

meteorcloudy commented 1 month ago

@fweikert Do you know if there is any potential infrastructure change that could cause this?

meteorcloudy commented 1 month ago

The issue seems to be reproducible on some VMs, so it might be related to some infra issue.

meteorcloudy commented 1 month ago

I will no dig deeper since https://github.com/bazelbuild/bazel/commit/355b000accfff4ea29876a46321308fe9422a9d1 mitigated the issue, and we probably need to upgrade gprc, netty versions and hope that could help. https://github.com/bazelbuild/bazel/issues/22719

meteorcloudy commented 1 month ago

First appearance seems to be: https://github.com/bazelbuild/bazel/commit/f64cdea68bb08717ea83c61d6c1567edb21c132c presubmit: https://buildkite.com/bazel/google-bazel-presubmit/builds/84129#_ postsubmit: https://buildkite.com/bazel/bazel-bazel/builds/29385

at it's parent commit: presubmit: https://buildkite.com/bazel/google-bazel-presubmit/builds/84137 postsubmit: https://buildkite.com/bazel/bazel-bazel/builds/29383

meteorcloudy commented 1 month ago

Found an even earlier flaky build: https://buildkite.com/bazel/google-bazel-presubmit/builds/84128 which might rule out https://github.com/bazelbuild/bazel/commit/f64cdea68bb08717ea83c61d6c1567edb21c132c

Wyverald commented 1 month ago

Do any of these fixes need to be cherry-picked back to 7.4.0 and/or 8.0.0?

meteorcloudy commented 1 month ago

https://github.com/bazelbuild/bazel/commit/efa030314ec5f7340d770b6ae9736a9e24d29ee3 should be backported to 8.0.0

meteorcloudy commented 1 month ago

@bazel-io fork 8.0.0

iancha1992 commented 3 weeks ago

A fix for this issue has been included in Bazel 8.0.0 RC2. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=8.0.0rc2. Thanks!