bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.99k stars 4.03k forks source link

Bazel CI: RBE builds broken #12920

Closed meteorcloudy closed 3 years ago

meteorcloudy commented 3 years ago

rules_docker: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1889#8859d532-0524-4468-a0ad-310325e56399 rules_go: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1889#30ce89e0-766c-47a4-8daf-9b4827421fbe rules_nodjes: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1889#54f552d8-9843-47be-aad2-a1025d28c1b9

All failing with similar build error:

(11:49:29) ERROR: /var/lib/buildkite-agent/builds/bk-docker-6srk/bazel-downstream-projects/rules_docker/tests/container/BUILD:775:16: GZIP tests/container/new_alpine_linux_ppc64le_image_oci_go_join_layers-layer.tar.gz failed: (Exit 34): UNAVAILABLE: HTTP/2 error code: NO_ERROR
Received Goaway
load_shed
java.io.IOException: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
Received Goaway
load_shed
    at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$handleStatus$5(GrpcCacheClient.java:244)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:192)
    at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:179)
    at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:124)
    at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:760)
    at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
    at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
    at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
    at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:83)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413)
    at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:721)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR

Culprit finder says a previous working Bazel commit no longer works: https://buildkite.com/bazel/culprit-finder/builds/1123

/cc @coeuvre

meteorcloudy commented 3 years ago

Oh, we also see breakage with Bazel 4.0.0: https://buildkite.com/bazel/rules-go-golang/builds/2726#_

So it looks like there is some RBE backend changes recently?

coeuvre commented 3 years ago

Looks similar to #12363.

coeuvre commented 3 years ago

cc @bergsieker

bergsieker commented 3 years ago

Can you provide an approximate time of onset?

I don't think there's anything that has changed in RBE that would cause this, but I believe that the GFEs have been rolling out new load shedding code this week. It's strange that we don't have any indication of failures in RBE's tests, though.

bergsieker commented 3 years ago

We believe that we identified the proximate cause (GFE rollout). We have initiated a flag-flip rollout to disable the new behavior. The rollout is mostly complete for North America, and will continue to to roll out globally over the next day or so.

Can you confirm that these builds have gone back to green?

Also, do these builds run on GCP machines?

coeuvre commented 3 years ago

Can you confirm that these builds have gone back to green?

These builds have gone back to green.

Also, do these builds run on GCP machines?

Yes, the worker-pool is projects/bazel-untrusted/instances/default_instance.