CDCgov / trusted-intermediary

Bringing together healthcare providers by reducing the connection burden.
Apache License 2.0
11 stars 5 forks source link

🐞 - Intermittent permission failures retrieving secrets from KeyVault #1386

Open basiliskus opened 5 days ago

basiliskus commented 5 days ago

Bug

Describe the Bug

We're getting an intermittent exception in TI staging when RS sends a message to TI and TI tries to authenticate the token received with its public key. We don't know why sometimes it happens, and sometimes not.

Here's the stacktrace for one that happened on 2024-10-04T18:40:50.4613617Z UTC:

gov.hhs.cdc.trustedintermediary.wrappers.SecretRetrievalException: Not able to retrieve secret trusted-intermediary-public-key-stg from Azure
at gov.hhs.cdc.trustedintermediary.external.azure.AzureSecrets.getKey(AzureSecrets.java:50)
at gov.hhs.cdc.trustedintermediary.auth.AuthRequestValidator.retrievePublicKey(AuthRequestValidator.java:66)
at gov.hhs.cdc.trustedintermediary.auth.AuthRequestValidator.isValidAuthenticatedRequest(AuthRequestValidator.java:50)
at gov.hhs.cdc.trustedintermediary.external.javalin.DomainsRegistration.authenticateRequest(DomainsRegistration.java:158)
at gov.hhs.cdc.trustedintermediary.external.javalin.DomainsRegistration.processRequest(DomainsRegistration.java:140)
at gov.hhs.cdc.trustedintermediary.external.javalin.DomainsRegistration.lambda$createHandler$4(DomainsRegistration.java:128)
at io.javalin.router.Endpoint.handle(Endpoint.kt:52)
at io.javalin.router.ParsedEndpoint.handle(ParsedEndpoint.kt:15)
at io.javalin.http.servlet.DefaultTasks.HTTP$lambda$9$lambda$7$lambda$6(DefaultTasks.kt:52)
at io.javalin.http.servlet.JavalinServlet.handleTask(JavalinServlet.kt:99)
at io.javalin.http.servlet.JavalinServlet.handleSync(JavalinServlet.kt:64)
at io.javalin.http.servlet.JavalinServlet.handle(JavalinServlet.kt:50)
at io.javalin.http.servlet.JavalinServlet.service(JavalinServlet.kt:30)
at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:587)
at io.javalin.jetty.JavalinJettyServlet.service(JavalinJettyServlet.kt:52)
at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:587)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:529)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1580)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1381)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1553)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1303)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
at org.eclipse.jetty.server.Server.handle(Server.java:563)
at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: com.azure.core.exception.ResourceModifiedException: Status code 403, "{"error":{"code":"Forbidden","message":"The user, group or application 'appid=d75a1e3a-b57b-4b9d-a719-6a8c95ced08d;oid=ec77be10-b1cb-4063-9936-27c21ef8cca0;iss=https://sts.windows.net/9ce70869-60db-44fd-abe8-d2767077fc8f/' does not have secrets get permission on key vault 'ti-key-vault-stg;location=eastus'. For help resolving this issue, please see https://go.microsoft.com/fwlink/?linkid=2125287","innererror":{"code":"AccessDenied"}}}"
at com.azure.security.keyvault.secrets.SecretAsyncClient.mapGetSecretException(SecretAsyncClient.java:430)
at com.azure.security.keyvault.secrets.SecretClient.callWithMappedException(SecretClient.java:1027)
at com.azure.security.keyvault.secrets.SecretClient.getSecretWithResponse(SecretClient.java:359)
at com.azure.security.keyvault.secrets.SecretClient.getSecret(SecretClient.java:296)
at gov.hhs.cdc.trustedintermediary.external.azure.AzureSecrets.getKey(AzureSecrets.java:47)
... 46 common frames omitted

Impact

Please describe the impact this bug is causing to you.

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected Behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Logs

If applicable, please attach logs to help describe your problem.

Version

The version or git commit sha of the application that the bug found on.

Additional Context

Add any other context about the problem here.

GilmoreA6 commented 2 days ago

dependency azure keyvault secrets

update merged to main: Sat Sep 28 10:45:50 2024 +0000 errors in staging: Fri Sep 23 4:16 UTC

update deployed to production: Mon, 30 Sep 2024 21:17:16 UTC errors appear in production logs: Mon 30 Sep 2024 22:22:25. UTC

reviewed logs back to 09/07/2024 and confirmed these are the first dates the errors appear. Due to the timing it does not look like this is an issue with this particular dependency as the errors appear before the dependency is updated

somesylvie commented 16 hours ago

More description of what's happening:

Query to find logs: AppServiceConsoleLogs | where ResultDescription contains "Not able to retrieve secret"

The logs themselves are being created by a live slot (you can see that in the resource ID of the log), but the resource mentioned within the log that isn't able to access the secret is a pre-live slot

Permissions for the live slot/main web app to access the key vault: https://github.com/CDCgov/trusted-intermediary/blob/main/operations/template/key.tf#L56

somesylvie commented 15 hours ago

Which host this happens on does change over time, but the errors don't interweave - so both errors on 9/23 were on ln0sdlwk0003CL, then all the errors from 9/27-9/30 were on ln1sdlwk0000QN, all but the last error on 10/4 were on ln0sdlwk0002SL, and the last error on 10/4 through 10/8 was on ln0sdlwk0002SL. There was a deploy on 10/4 that might have happened between the two errors where the impacted host changed, but 1) the terraform part of that deploy didn't touch either the web app or the keyvault and 2) there have been plenty of other deploys in the meantime

Flexion RS config went into prod on 9/23 around 3pm eastern, and there was a TI deploy to prod later that day, but the prod errors didn't start until 9/30 - so it could be something that went in with the 9/30 deploy or something that's independent of deploys

GilmoreA6 commented 12 hours ago

~~Logs suggest the same request may be trying to be acted upon by multiple instances where one succeeds and another fails. Logs from 10/7 6:15 PM UTC show that both ln1sdlwk0000QN and ln0sdlwk0002Y1 received the POST from RS to the auth/token endpoint with ln0sdlwk0002Y1 returning a "Bad authentication service config" message first, resulting in a 500 response to RS and ln1sdlwk0000QN successfully generating a token and returning a 200 response but after RS had already received the 500 response.~~

somesylvie commented 11 hours ago

Gilmore's note above suggests there may be something weird going on with the load balancer

GilmoreA6 commented 11 hours ago

update: my theory above is incorrect, there are multiple requests coming from RS, one instance is succeeding and one is failing randomly, most of the time both succeed.

somesylvie commented 11 hours ago

When we send 200 responses from the token endpoint, it looks like it's often the cached key, so we confirmed that the-key-we're-looking-for does still exist in staging and is spelled properly

somesylvie commented 9 hours ago

We turned on diagnostic logs for the internal TI key vault, which will show more data on each attempt to access a secret. In both staging and internal, we're seeing as many 401/403 calls as 200s in the key vault metrics - this makes us suspect that the key vault client tries an un-authed request first and then an authed one, not sure if that would have anything to do with the errors we're seeing in the application

We also noted that the exception coming from the Azure SDK is a ResourceModifiedException which seems odd for a read operation

We have some suspicions around concurrency limits (since the error mostly happens when there are two simultaneous calls), but Azure's docs don't back up that suspicion