bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.3k stars 4.09k forks source link

CredentialHelper change broke unix domain socket support #16171

Closed srago closed 2 years ago

srago commented 2 years ago

We use a unix domain socket for our remote_executor and remote_cache endpoints, because we use a side-car proxy to handle SPIFFE mTLS negotiation. This stopped working with the release of bazel 5.3.0. With 5.3.0, bazel fails before issuing its first GetCapabilitiesRequest. The error message was "ERROR: Failed to query remote execution capabilities: UNAUTHENTICATED: Failed computing credential metadata." When enabling verbose failures, the stack trace pointed to a failed precondition because the host portion of the URI was null. I played around with the code and got it working again, but the fix isn't production-quality -- it doesn't handle the case where the remote cache endpoint differs from the remote executor endpoint, and might need different credentials. Because we're using unix domain sockets, this shouldn't affect us at all. Anyway, consider the following patch a place to start the conversation about what the correct fix might look like.

index a7d793ad56..d7db02bd6f 100644
--- a/src/main/java/com/google/devtools/build/lib/remote/RemoteModule.java
+++ b/src/main/java/com/google/devtools/build/lib/remote/RemoteModule.java
@@ -1055,6 +1055,22 @@ public final class RemoteModule extends BlazeModule {
       AuthAndTLSOptions authAndTlsOptions,
       RemoteOptions remoteOptions)
       throws IOException {
+
+    // TODO: this is a temporary hack.  Because remoteCache and remoteExecutor could be different, they might need
+    // different credentials.  If the endpoints used to access remote services are unix domain sockets, credential
+    // helpers aren't needed -- the service is running on the same machine.
+    //
+    // Also note that URI("unix://...") will replace "unix" with "https" for some reason that escapes me, but that
+    // makes it tricky to use.  Somewhere (not here because of the checks in the try clause below), that was
+    // happening, because by the time findCredentialHelper() is called, these URIs have been changed from "unix://"
+    // to "https://".  According to the bazel documentation, the "unix:" scheme is supposed to disable TLS.
+    if (remoteOptions.remoteCache != null && Ascii.toLowerCase(remoteOptions.remoteCache).startsWith("unix:/")) {
+           return null;
+    }
+    if (remoteOptions.remoteExecutor != null && Ascii.toLowerCase(remoteOptions.remoteExecutor).startsWith("unix:/")) {
+           return null;
+    }
+
     Credentials credentials =
         GoogleAuthUtils.newCredentials(
             credentialHelperEnvironment, commandLinePathFactory, fileSystem, authAndTlsOptions);
fmeum commented 2 years ago

@bazel-io flag

tjgq commented 2 years ago

Sorry about that. This corner of Bazel is poorly tested; I guess something like this was bound to happen.

When enabling verbose failures, the stack trace pointed to a failed precondition because the host portion of the URI was null

Could you please share the full stack trace and the exact values of --remote_executor and --remote_cache you're using?

srago commented 2 years ago

Here's the command line and stack trace. The line numbers might be off a bit -- I recompiled bazel from the head and added a couple printf statements.

$ bazel build --remote_cache=unix:///home/coder/.cache/fax/3a162c244f546dff591bc0d335e9c174/jail.sock --remote_executor=unix:///home/coder/.cache/fax/3a162c244f546dff591bc0d335e9c174/jail.sock :all --verbose_failures
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Invocation ID: 55dd5c7b-43ea-4a6f-a6cf-0c05ce656dbe
ERROR: java.io.IOException: io.grpc.StatusRuntimeException: UNAUTHENTICATED: Failed computing credential metadata
        at com.google.devtools.build.lib.remote.RemoteServerCapabilities.get(RemoteServerCapabilities.java:90)
        at com.google.devtools.build.lib.remote.RemoteModule.verifyServerCapabilities(RemoteModule.java:196)
        at com.google.devtools.build.lib.remote.RemoteModule.beforeCommand(RemoteModule.java:492)
        at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:378)
        at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:232)
        at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)
        at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:614)
        at io.grpc.Context$1.run(Context.java:566)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: UNAUTHENTICATED: Failed computing credential metadata
        at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
        at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
        at build.bazel.remote.execution.v2.CapabilitiesGrpc$CapabilitiesBlockingStub.getCapabilities(CapabilitiesGrpc.java:215)
        at com.google.devtools.build.lib.remote.RemoteServerCapabilities.lambda$get$0(RemoteServerCapabilities.java:85)
        at com.google.devtools.build.lib.remote.ReferenceCountedChannel.lambda$withChannelBlocking$2(ReferenceCountedChannel.java:85)
        at com.google.devtools.build.lib.remote.ReferenceCountedChannel.lambda$withChannel$4(ReferenceCountedChannel.java:108)
        at io.reactivex.rxjava3.internal.operators.single.SingleUsing.subscribeActual(SingleUsing.java:59)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback.onSuccess(SingleFlatMap.java:85)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onSuccess(SingleFlatMap.java:112)
        at io.reactivex.rxjava3.internal.operators.single.SingleMap$MapSingleObserver.onSuccess(SingleMap.java:65)
        at io.reactivex.rxjava3.internal.operators.single.SingleDoOnDispose$DoOnDisposeObserver.onSuccess(SingleDoOnDispose.java:84)
        at io.reactivex.rxjava3.internal.operators.single.SingleDoOnError$DoOnError.onSuccess(SingleDoOnError.java:52)
        at io.reactivex.rxjava3.internal.operators.observable.ObservableSingleSingle$SingleElementObserver.onComplete(ObservableSingleSingle.java:110)
        at io.reactivex.rxjava3.internal.observers.DeferredScalarDisposable.complete(DeferredScalarDisposable.java:85)
        at io.reactivex.rxjava3.subjects.AsyncSubject.subscribeActual(AsyncSubject.java:233)
        at io.reactivex.rxjava3.core.Observable.subscribe(Observable.java:13176)
        at io.reactivex.rxjava3.internal.operators.observable.ObservableSingleSingle.subscribeActual(ObservableSingleSingle.java:36)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleDoOnError.subscribeActual(SingleDoOnError.java:35)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleDoOnDispose.subscribeActual(SingleDoOnDispose.java:38)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleMap.subscribeActual(SingleMap.java:35)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback.onSuccess(SingleFlatMap.java:85)
        at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onSuccess(SingleCreate.java:68)
        at com.google.devtools.build.lib.remote.grpc.TokenBucket$1.onNext(TokenBucket.java:79)
        at io.reactivex.rxjava3.internal.util.NotificationLite.accept(NotificationLite.java:247)
        at io.reactivex.rxjava3.subjects.BehaviorSubject$BehaviorDisposable.test(BehaviorSubject.java:507)
        at io.reactivex.rxjava3.subjects.BehaviorSubject$BehaviorDisposable.emitFirst(BehaviorSubject.java:468)
        at io.reactivex.rxjava3.subjects.BehaviorSubject.subscribeActual(BehaviorSubject.java:224)
        at io.reactivex.rxjava3.core.Observable.subscribe(Observable.java:13176)
        at com.google.devtools.build.lib.remote.grpc.TokenBucket.lambda$acquireToken$0(TokenBucket.java:64)
        at io.reactivex.rxjava3.internal.operators.single.SingleCreate.subscribeActual(SingleCreate.java:40)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:37)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleDefer.subscribeActual(SingleDefer.java:43)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:37)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.core.Single.blockingGet(Single.java:3644)
        at com.google.devtools.build.lib.remote.ReferenceCountedChannel.withChannelBlocking(ReferenceCountedChannel.java:85)
        at com.google.devtools.build.lib.remote.RemoteServerCapabilities.lambda$get$1(RemoteServerCapabilities.java:84)
        at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:244)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:125)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:114)
        at com.google.devtools.build.lib.remote.RemoteServerCapabilities.get(RemoteServerCapabilities.java:82)
        ... 10 more
Caused by: java.lang.NullPointerException
        at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:889)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperProvider.findCredentialHelper(CredentialHelperProvider.java:74)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperCredentials$CredentialHelperCacheLoader.load(CredentialHelperCredentials.java:132)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperCredentials$CredentialHelperCacheLoader.load(CredentialHelperCredentials.java:114)
        at com.github.benmanes.caffeine.cache.LocalLoadingCache.lambda$newMappingFunction$2(LocalLoadingCache.java:141)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2413)
        at java.base/java.util.concurrent.ConcurrentHashMap.compute(Unknown Source)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2411)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2394)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
        at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:54)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperCredentials.getRequestMetadataFromCredentialHelper(CredentialHelperCredentials.java:90)
        at com.google.devtools.build.lib.authandtls.credentialhelper.CredentialHelperCredentials.getRequestMetadata(CredentialHelperCredentials.java:74)
        at com.google.auth.Credentials.blockingGetToCallback(Credentials.java:112)
        at com.google.auth.Credentials$1.run(Credentials.java:98)
        at io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:740)
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:149)
        ... 57 more

ERROR: Failed to query remote execution capabilities: UNAUTHENTICATED: Failed computing credential metadata
tjgq commented 2 years ago

Thanks, this is helpful. It appears that new URI("unix:///foo/bar).getHost() == null and new URI("unix:/foo/bar").getHost() == null, but new URI("unix://foo/bar").getHost() == "foo". I wonder if using one or two slashes after unix: instead of three is a valid workaround. Hopefully the gRPC library isn't too picky...

That being said, I totally agree that we shouldn't crash here! I think the right fix is to replace the precondition check in findCredentialHelper with an early exit if the host part is empty/null. That way we don't have to make any assumptions about URI schemes. I'll put a fix together on Monday.

srago commented 2 years ago

Regardless of the number of slashes, it used to work with versions of bazel older than 5.3. I've used it on 5.2.0, 5.1.0, 5.0.0, and 4.2.2 to name a few. But I'll look into seeing how it behaves when we have only 1 slash and let you know what I find.

srago commented 2 years ago

Yep, regardless of the number of slashes, it fails at the same precondition check. (I tried 1, 2, and 3.) Also, for unix domain sockets, there is no host -- the entire remainder of the string is the pathname of the socket file, so getHost() probably should return null when the scheme is "unix:".

Wyverald commented 2 years ago

@tjgq do you think this warrants a patch release?

tjgq commented 2 years ago

@Wyverald It's a regression that affects existing users and there doesn't seem to be a workaround for it, so probably yes? (I don't know what's our patch release policy.)

@srago I'd like to go with #16184 as the fix. Could you please confirm that it works for you?

cc @Yannic - until #16185 is fixed, we should manually verify that a UDS proxy works when making changes to credential helpers.

Wyverald commented 2 years ago

@bazel-io fork 5.3.1

srago commented 2 years ago

Confirmed #16184 fixes the problem with Unix domain sockets. Thanks for the quick response time!