Azure / azure-workload-identity

Azure AD Workload Identity uses Kubernetes primitives to associate managed identities for Azure resources and identities in Azure Active Directory (AAD) with pods.
https://azure.github.io/azure-workload-identity
MIT License
298 stars 95 forks source link

failed to get token error on one azwi-proxy container from one of the PODs (out of 3) #650

Closed paulomarquesc closed 1 year ago

paulomarquesc commented 1 year ago

I'm working on adopting Azure Workload Identity and from time to time the container azwi-proxy seems to be unable to obtain tokens, here is the log from the azwi-proxy container on the failing POD (anfgenevaactionssreapiworker-5c66c488c6-4t527):

I1129 15:42:20.604970       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"
E1129 15:42:50.607679       1 proxy.go:125] proxy "msg"="failed to get token" "error"="failed to acquire token: unable to resolve an endpoint: server response error:\n Get \"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration\": context deadline exceeded"

This is intermittent but when it gets into this state the application running that depends on Managed Identity authentication starts failing due to the error above, also, when this happens it is not on all PODs, you can see the logs from other two (out 3 PODS) azwi-proxy container, where they seem to still be cached:

pmarques@pmarques-dev [ ~/git/AzureNetapp-GenevaActions-SRE-API ]$ kubectl logs anfgenevaactionssreapiworker-5c66c488c6-ztdrt azwi-proxy -f
I1118 22:14:43.700723       1 proxy.go:97] proxy "msg"="starting the proxy server" "port"=8081 "userAgent"="azure-workload-identity/proxy/v0.14.0 (linux/amd64) 0198270/2022-10-20-21:17"
I1118 22:14:43.752357       1 proxy.go:173] proxy "msg"="received readyz request" "method"="GET" "uri"="/readyz"
I1118 22:14:50.816573       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01"
I1118 22:14:50.817217       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"
I1118 22:14:50.991554       1 proxy.go:129] proxy "msg"="successfully acquired token" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"
I1118 22:14:54.692395       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01"
I1118 22:14:54.693135       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"
I1118 22:14:54.948651       1 proxy.go:129] proxy "msg"="successfully acquired token" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"

pmarques@pmarques-dev [ ~/git/AzureNetapp-GenevaActions-SRE-API ]$ kubectl logs anfgenevaactionssreapiworker-5c66c488c6-v6ptx azwi-proxy -f
I1118 22:16:41.327611       1 proxy.go:97] proxy "msg"="starting the proxy server" "port"=8081 "userAgent"="azure-workload-identity/proxy/v0.14.0 (linux/amd64) 0198270/2022-10-20-21:17"
I1118 22:16:41.403870       1 proxy.go:173] proxy "msg"="received readyz request" "method"="GET" "uri"="/readyz"
I1118 22:16:50.359903       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01"
I1118 22:16:50.361570       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"
I1118 22:16:50.798925       1 proxy.go:129] proxy "msg"="successfully acquired token" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"
I1118 22:16:54.316535       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01"
I1118 22:16:54.317374       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"
I1118 22:16:54.653885       1 proxy.go:129] proxy "msg"="successfully acquired token" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"

Here is the output of the get pods and we can see that the one that is trying to obtain the new token is the one that keeps restarting:

pmarques@pmarques-dev [ ~/git/AzureNetapp-GenevaActions-SRE-API ]$ kubectl get pods
NAME                                            READY   STATUS    RESTARTS        AGE
anfgenevaactionssreapi-769c58cc9-mqnlq          2/2     Running   0               10d
anfgenevaactionssreapi-769c58cc9-pr5hv          2/2     Running   0               10d
anfgenevaactionssreapi-769c58cc9-x2g6q          2/2     Running   0               10d
anfgenevaactionssreapiworker-5c66c488c6-4t527   2/2     Running   98 (105m ago)   10d
anfgenevaactionssreapiworker-5c66c488c6-v6ptx   2/2     Running   0               10d
anfgenevaactionssreapiworker-5c66c488c6-ztdrt   2/2     Running   0               10d
aramase commented 1 year ago

E1129 15:42:50.607679 1 proxy.go:125] proxy "msg"="failed to get token" "error"="failed to acquire token: unable to resolve an endpoint: server response error:\n Get \"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration\": context deadline exceeded"

@paulomarquesc context deadline exceeded indicates there is a network issue from pod to that endpoint. As this is happening only a single node, it might be a network issue on that node. You can validate connectivity by exec into the sidecar container and trying to connect to that endpoint.

paulomarquesc commented 1 year ago

Hi @aramase, is there any other log I check? Because while inside the sidecar, I can do a wget on login.microsoftonline.com without any issues:

root [ / ]# wget login.microsoftonline.com
--2022-11-29 17:51:04--  http://login.microsoftonline.com/
Resolving login.microsoftonline.com... 40.126.26.132, 20.190.154.136, 20.190.154.138, ...
Connecting to login.microsoftonline.com|40.126.26.132|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://login.microsoftonline.com:443/ [following]
--2022-11-29 17:51:04--  https://login.microsoftonline.com/
Connecting to login.microsoftonline.com|40.126.26.132|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.office.com/login# [following]
--2022-11-29 17:51:05--  https://www.office.com/login
Resolving www.office.com... 13.107.6.156, 2620:1ec:a92::156
Connecting to www.office.com|13.107.6.156|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=4765445b-32c6-49b0-83e6-1d93765276ca&redirect_uri=https%3A%2F%2Fwww.office.com%2Flandingv2&response_type=code%20id_token&scope=openid%20profile%20https%3A%2F%2Fwww.office.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce=638053410651413416.N2QzM2ZkOTctNmM2Yy00M2Q0LWI0Y2UtYmVmNDNlOTg3MDUxOTcxM2QwZjMtYWJkOS00OTA4LTg3MDktMTRmM2Q1MWU2ZGEx&ui_locales=en-US&mkt=en-US&state=EDrpOA6KU_msgBqSvCCZRwfL_63LNJoZ5nBTFbzHFO0tQKcy849tkTP3LEdt8pTs73IeUueAKloKBuCmFdWvbXN3yE2DWdVKE_mwSZth3PluqCR4a8TdqvtsM2EubLXHhj7cifCC39dJLYi0T-Fgprbma6vrMtKMoVjRDUi5VQoWBDmrPWzwGecb9ut5WdHVMH5PpODVrJoaleedsog5bM1e0Sx9_jatJp7wTv5jNnDb5zuYtQJi3PHTbBtx6dwlUgzLqFSPEq4oro4YbC9I7w&x-client-SKU=ID_NETSTANDARD2_0&x-client-ver=6.16.0.0 [following]
--2022-11-29 17:51:05--  https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=4765445b-32c6-49b0-83e6-1d93765276ca&redirect_uri=https%3A%2F%2Fwww.office.com%2Flandingv2&response_type=code%20id_token&scope=openid%20profile%20https%3A%2F%2Fwww.office.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce=638053410651413416.N2QzM2ZkOTctNmM2Yy00M2Q0LWI0Y2UtYmVmNDNlOTg3MDUxOTcxM2QwZjMtYWJkOS00OTA4LTg3MDktMTRmM2Q1MWU2ZGEx&ui_locales=en-US&mkt=en-US&state=EDrpOA6KU_msgBqSvCCZRwfL_63LNJoZ5nBTFbzHFO0tQKcy849tkTP3LEdt8pTs73IeUueAKloKBuCmFdWvbXN3yE2DWdVKE_mwSZth3PluqCR4a8TdqvtsM2EubLXHhj7cifCC39dJLYi0T-Fgprbma6vrMtKMoVjRDUi5VQoWBDmrPWzwGecb9ut5WdHVMH5PpODVrJoaleedsog5bM1e0Sx9_jatJp7wTv5jNnDb5zuYtQJi3PHTbBtx6dwlUgzLqFSPEq4oro4YbC9I7w&x-client-SKU=ID_NETSTANDARD2_0&x-client-ver=6.16.0.0
Connecting to login.microsoftonline.com|40.126.26.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 201363 (197K) [text/html]
Saving to: 'index.html'

index.html                            100%[======================================================================>] 196.64K  --.-KB/s    in 0.001s

2022-11-29 17:51:05 (146 MB/s) - 'index.html' saved [201363/201363]

And it also works fine following the whole URI (https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration):

root [ / ]# wget https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration
--2022-11-29 17:53:07--  https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration
Resolving login.microsoftonline.com... 20.190.154.139, 20.190.154.16, 20.190.154.18, ...
Connecting to login.microsoftonline.com|20.190.154.139|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1753 (1.7K) [application/json]
Saving to: 'openid-configuration'

openid-configuration                  100%[======================================================================>]   1.71K  --.-KB/s    in 0s

2022-11-29 17:53:07 (880 MB/s) - 'openid-configuration' saved [1753/1753]

root [ / ]# cat openid-configuration
{"token_endpoint":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/v2.0/token","token_endpoint_auth_methods_supported":["client_secret_post","private_key_jwt","client_secret_basic"],"jwks_uri":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/discovery/v2.0/keys","response_modes_supported":["query","fragment","form_post"],"subject_types_supported":["pairwise"],"id_token_signing_alg_values_supported":["RS256"],"response_types_supported":["code","id_token","code id_token","id_token token"],"scopes_supported":["openid","profile","email","offline_access"],"issuer":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0","request_uri_parameter_supported":false,"userinfo_endpoint":"https://graph.microsoft.com/oidc/userinfo","authorization_endpoint":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/v2.0/authorize","device_authorization_endpoint":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/v2.0/devicecode","http_logout_supported":true,"frontchannel_logout_supported":true,"end_session_endpoint":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/v2.0/logout","claims_supported":["sub","iss","cloud_instance_name","cloud_instance_host_name","cloud_graph_host_name","msgraph_host","aud","exp","iat","auth_time","acr","nonce","preferred_username","name","tid","ver","at_hash","c_hash","email"],"kerberos_endpoint":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/kerberos","tenant_region_scope":"WW","cloud_instance_name":"microsoftonline.com","cloud_graph_host_name":"graph.windows.net","msgraph_host":"graph.microsoft.com","rbac_url":"https://pas.windows.net"}

And now it seems to be working again:

I1129 17:31:13.901033       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"
I1129 17:31:14.532131       1 proxy.go:129] proxy "msg"="successfully acquired token" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"

This seems not well stable, I never ran into this type of issue using the deprecated AAD container solution.

Another thing that is puzzling me is that why this solution is using the global AAD endpoint instead of the regional ones (e-stsr)

aramase commented 1 year ago

This seems not well stable, I never ran into this type of issue using the deprecated AAD container solution.

If it's working now, that could mean intermittent network issue in the node. The old AAD Pod Identity project relied on getting tokens from IMDS and not directly from AAD endpoint. IMDS is not used in workload identity federation. Also, the proxy is just a migration solution. Long term, if you use the latest SDK, you don't need to use the proxy sidecar.

Another thing that is puzzling me is that why this solution is using the global AAD endpoint instead of the regional ones (e-stsr)

ests-r currently only supports 1st party apps. There are plans to support it for 3P Apps. Also, that would be done internally in the SDK and not specifically in the proxy. So whenever that's supported in the MSAL SDK with 3P Apps, the proxy will use it.

paulomarquesc commented 1 year ago

Thanks @aramase, is there a way I can set a liveness probe (or similar mechanism) in this sidecar so k8s can recycle it if it is having these issues?

paulomarquesc commented 1 year ago

@aramase , something is not right with the azwi-proxy, I redeployed a few times, everything was fine, then another sidecar started to act in the same way:

I1201 15:47:23.282305       1 proxy.go:107] proxy "msg"="received token request" "method"="GET" "uri"="/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=9c825a3f-4493-42cb-80ba-746320a5d8e7&resource=https%3A%2F%2Fvault.azure.net"
E1201 15:47:53.284700       1 proxy.go:125] proxy "msg"="failed to get token" "error"="failed to acquire token: unable to resolve an endpoint: server response error:\n Get \"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration\": context deadline exceeded"

I'm one of the dev owners of a 1st party Azure service and using AKS with Mariner 2.0 as my nodes, Mariner 2.0 as the image of my app and azwi-proxy cannot just keep failing to reach out to login.microsoft.com to obtain MSI credentials, the team providing this solution needs to go further in the investigation and not simply come with 'your side car is having network issues'.

aramase commented 1 year ago

@paulomarquesc If you have steps to consistently repro this, i'll be happy to take a look. If you have access to an environment where this is happening, please feel free to share access on team.

the team providing this solution needs to go further in the investigation and not simply come with 'your side car is having network issues'.

If the issue is happening intermittently and only on some worker nodes it could be a temporary blip on the node. We're running the tests as part of the CI and nightly runs where we haven't observed similar issues.

As I mentioned in the previous comment, the sidecar is only a temporary solution. If your application uses the latest SDK (https://azure.github.io/azure-workload-identity/docs/topics/language-specific-examples/azure-identity-sdk.html), there is no requirement for the sidecar. By dropping the sidecar proxy usage, your workload will directly make calls to login.microsoft.com.

paulomarquesc commented 1 year ago

To close this thread, I converted the first application based on go using the examples provided in the repo and it worked fine using the webhook instead of the sidecar. Thanks @aramase.