Open basiliskus opened 5 days ago
dependency azure keyvault secrets
update merged to main: Sat Sep 28 10:45:50 2024 +0000 errors in staging: Fri Sep 23 4:16 UTC
update deployed to production: Mon, 30 Sep 2024 21:17:16 UTC errors appear in production logs: Mon 30 Sep 2024 22:22:25. UTC
reviewed logs back to 09/07/2024 and confirmed these are the first dates the errors appear. Due to the timing it does not look like this is an issue with this particular dependency as the errors appear before the dependency is updated
More description of what's happening:
Query to find logs: AppServiceConsoleLogs | where ResultDescription contains "Not able to retrieve secret"
The logs themselves are being created by a live slot (you can see that in the resource ID of the log), but the resource mentioned within the log that isn't able to access the secret is a pre-live slot
Permissions for the live slot/main web app to access the key vault: https://github.com/CDCgov/trusted-intermediary/blob/main/operations/template/key.tf#L56
Which host this happens on does change over time, but the errors don't interweave - so both errors on 9/23 were on ln0sdlwk0003CL, then all the errors from 9/27-9/30 were on ln1sdlwk0000QN, all but the last error on 10/4 were on ln0sdlwk0002SL, and the last error on 10/4 through 10/8 was on ln0sdlwk0002SL. There was a deploy on 10/4 that might have happened between the two errors where the impacted host changed, but 1) the terraform part of that deploy didn't touch either the web app or the keyvault and 2) there have been plenty of other deploys in the meantime
Flexion RS config went into prod on 9/23 around 3pm eastern, and there was a TI deploy to prod later that day, but the prod errors didn't start until 9/30 - so it could be something that went in with the 9/30 deploy or something that's independent of deploys
~~Logs suggest the same request may be trying to be acted upon by multiple instances where one succeeds and another fails. Logs from 10/7 6:15 PM UTC show that both ln1sdlwk0000QN and ln0sdlwk0002Y1 received the POST from RS to the auth/token endpoint with ln0sdlwk0002Y1 returning a "Bad authentication service config" message first, resulting in a 500 response to RS and ln1sdlwk0000QN successfully generating a token and returning a 200 response but after RS had already received the 500 response.~~
Gilmore's note above suggests there may be something weird going on with the load balancer
update: my theory above is incorrect, there are multiple requests coming from RS, one instance is succeeding and one is failing randomly, most of the time both succeed.
When we send 200 responses from the token endpoint, it looks like it's often the cached key, so we confirmed that the-key-we're-looking-for does still exist in staging and is spelled properly
We turned on diagnostic logs for the internal
TI key vault, which will show more data on each attempt to access a secret. In both staging and internal, we're seeing as many 401/403 calls as 200s in the key vault metrics - this makes us suspect that the key vault client tries an un-authed request first and then an authed one, not sure if that would have anything to do with the errors we're seeing in the application
We also noted that the exception coming from the Azure SDK is a ResourceModifiedException
which seems odd for a read operation
We have some suspicions around concurrency limits (since the error mostly happens when there are two simultaneous calls), but Azure's docs don't back up that suspicion
Bug
Describe the Bug
We're getting an intermittent exception in TI staging when RS sends a message to TI and TI tries to authenticate the token received with its public key. We don't know why sometimes it happens, and sometimes not.
Here's the stacktrace for one that happened on 2024-10-04T18:40:50.4613617Z UTC:
Impact
Please describe the impact this bug is causing to you.
To Reproduce
Expected Behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Logs
If applicable, please attach logs to help describe your problem.
Version
The version or git commit sha of the application that the bug found on.
Additional Context
Add any other context about the problem here.