Closed dimbleby closed 2 years ago
@dimbleby Thank you for reporting the issue.
This certainly looks like the caching bug as you've confirmed the identity with the same client id exists on the underlying VMSS. Tagging @rkammara12 from the IMDS team.
Thanks. Is there anything we can do to recover this? Or would it be helpful to leave it in this state for debugging?
This is a test cluster, fortunately, but this looks like it would be an ugly bug to hit in production.
@dimbleby It would be helpful to leave it in that state. All the caching issue at IMDS are fixed and deployed, @aramase is there a way to access the cluster and node details so that I can verify if there is a new caching issue.
We are facing the same issue. The identity is assigned successfully to the VMSS (checked in the Azure portal), but the NMI is logging the following lines:
clientID in request: ##### REDACTED #####, **** has been matched with azure identity ****
matched identityType:0 clientid:##### REDACTED ##### resource:https://management.core.windows.net/
failed to get service principal token for pod: **** , error: failed to refresh token, error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"invalid_request","error_description":"User-assigned managed identity not found on this resource. To request tokens for this identity, please assign it first. Please see aka.ms/ManagedIdentityNotFound for more details."}
status (403) took 93572306 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.0.0.48
@rkammara12 This error looks slightly different from identity not found
. Do you know why this could be happening and what this error User-assigned managed identity not found on this resource
means?
@PSanetra says the identity with the clientID is assigned to the underlying VMSS.
The error is different and observed when the identity information for that resource (in this case VMSS) is not available at MIRP. Can you please provide o/p of "az vmss show" ?
Facing the same issue since yesterday. I've already confirmed that the User Assigned Managed Identity is getting assigned to the underlying VMSS, and it seems to be the correct one in the logs starting with ab12.
I0722 05:42:33.714813 1 server.go:196] status (403) took 63547724 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.240.0.14
W0722 05:43:03.648887 1 server.go:340] parameter resource cannot be empty
I0722 05:43:03.648914 1 server.go:196] status (400) took 49601 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.240.0.14
I0722 05:43:03.650814 1 standard.go:73] no clientID or resourceID in request. default/trading-77944db546-pr98j has been matched with azure identity default/trading
I0722 05:43:03.650988 1 standard.go:179] matched identityType:0 clientid:ab12##### REDACTED #####1fd9 resource:https://database.windows.net
E0722 05:43:03.719474 1 server.go:382] failed to get service principal token for pod: default/trading-77944db546-pr98j, error: failed to refresh token, error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"invalid_request","error_description":"Identity not found"}
Is there anything I can provide that helps?
Output of az vmss show:
{
"additionalCapabilities": null,
"automaticRepairsPolicy": null,
"doNotRunExtensionsOnOverprovisionedVMs": false,
"extendedLocation": null,
"hostGroup": null,
"id": "/subscriptions/b5766ca7-b234-4caa-b863-db525c6843fc/resourceGroups/RG-Hymon-dev-eu-backend-nodes/providers/Microsoft.Compute/virtualMachineScaleSets/aks-backend-40778374-vmss",
"identity": {
"principalId": null,
"tenantId": null,
"type": "UserAssigned",
"userAssignedIdentities": {
"/subscriptions/b5766ca7-b234-4caa-b863-db525c6843fc/resourceGroups/RG-HYMON-DEV-EU-Backend-NODES/providers/Microsoft.ManagedIdentity/userAssignedIdentities/hymon-dev-EU-agentpool": {
"clientId": "8d524b0d-43b7-49de-b0e7-7ef96b7ec4f1",
"principalId": "9bd93702-6f7d-4861-85aa-963693eb523f"
},
"/subscriptions/b5766ca7-b234-4caa-b863-db525c6843fc/resourcegroups/RG-HYMON-DEV-EU-Backend-NODES/providers/Microsoft.ManagedIdentity/userAssignedIdentities/azurepolicy-hymon-dev-eu": {
"clientId": "24d9d17a-4005-4334-8d87-6b67625379aa",
"principalId": "a3501c7a-3593-4e26-a2da-3ce04e154c76"
},
"/subscriptions/b5766ca7-b234-4caa-b863-db525c6843fc/resourcegroups/RG-HYMON-DEV-EU-Backend-NODES/providers/Microsoft.ManagedIdentity/userAssignedIdentities/omsagent-hymon-dev-eu": {
"clientId": "c0e78e41-e427-4ba5-afad-19d48d4455e6",
"principalId": "19a6dae2-22dc-4c09-9455-054b706501e6"
},
"/subscriptions/b5766ca7-b234-4caa-b863-db525c6843fc/resourcegroups/rg-hymon-dev-eu-backend/providers/microsoft.managedidentity/userassignedidentities/trading-dev-eu": {
"clientId": "4b9f1732-e259-46f9-8a1d-cbb8113a493c",
"principalId": "ab121f9e-402d-44eb-9c85-a0cccd751fd9"
}
}
},
"location": "westeurope",
"name": "aks-backend-40778374-vmss",
"orchestrationMode": null,
"overprovision": false,
"plan": null,
"platformFaultDomainCount": 1,
"provisioningState": "Succeeded",
"proximityPlacementGroup": null,
"resourceGroup": "RG-Hymon-dev-eu-backend-nodes",
"scaleInPolicy": null,
"singlePlacementGroup": false,
"sku": {
"capacity": 1,
"name": "Standard_B2ms",
"tier": "Standard"
},
<REDACTED>
ab12
is the principal ID. Have you tried using 4b9f1732-e259-46f9-8a1d-cbb8113a493c
in your AzureIdentity
?
Shame on me.. Thanks for pointing me on that one.
I've just encountered this and wasted a few hours.
Probably I found a clue:
When the new pod was created and ended up with the token failure, mic yielded the following output:
(Background: the pod in question was the very first application that made use of the identity in the cluster, and that's why mic was trying to provision the identity here. The situation should look different from the original report, but the same should happen when the token gets to its expiry and refresh takes place.)
I0730 13:31:07.992397 1 stats.go:183] ** stats collected **
I0730 13:31:07.992456 1 stats.go:162] Pod listing: 44.2µs
I0730 13:31:07.992466 1 stats.go:162] AzureIdentity listing: 5µs
I0730 13:31:07.992471 1 stats.go:162] AzureIdentityBinding listing: 11.2µs
I0730 13:31:07.992474 1 stats.go:162] AzureAssignedIdentity listing: 1.2µs
I0730 13:31:07.992478 1 stats.go:162] System: 115.9µs
I0730 13:31:07.992482 1 stats.go:162] CacheSync: 4.2µs
I0730 13:31:07.992495 1 stats.go:162] Cloud provider GET: 31.918717ms
I0730 13:31:07.992499 1 stats.go:162] Cloud provider PATCH: 47.724182782s
I0730 13:31:07.992503 1 stats.go:162] AzureAssignedIdentity creation: 19.493211ms
I0730 13:31:07.992507 1 stats.go:162] AzureAssignedIdentity update: 0s
I0730 13:31:07.992510 1 stats.go:162] AzureAssignedIdentity deletion: 0s
I0730 13:31:07.992515 1 stats.go:170] Number of cloud provider PATCH: 1
I0730 13:31:07.992519 1 stats.go:170] Number of cloud provider GET: 1
I0730 13:31:07.992523 1 stats.go:170] Number of AzureAssignedIdentities created in this sync cycle: 1
I0730 13:31:07.992526 1 stats.go:170] Number of AzureAssignedIdentities updated in this sync cycle: 0
I0730 13:31:07.992529 1 stats.go:170] Number of AzureAssignedIdentities deleted in this sync cycle: 0
I0730 13:31:07.992533 1 stats.go:162] Find AzureAssignedIdentities to create: 0s
I0730 13:31:07.992537 1 stats.go:162] Find AzureAssignedIdentities to delete: 0s
I0730 13:31:07.992540 1 stats.go:162] Total time to assign or update AzureAssignedIdentities: 47.798850229s
I0730 13:31:07.992544 1 stats.go:162] Total: 47.799110429s
I0730 13:31:07.992547 1 stats.go:212] *********************
This implies that the MSI assignment to VMSS had been completed on around 13:31:07.
On the other hand, nmi yielded the following:
I0730 13:29:39.275599 1 standard.go:80] clientID in request: bf65##### REDACTED #####256d, default/http-test has been matched with azure identity default/pods-default
I0730 13:29:39.275782 1 standard.go:179] matched identityType:0 clientid:bf65##### REDACTED #####256d resource:https://vault.azure.net
E0730 13:29:39.669452 1 server.go:392] failed to get service principal token for pod: default/http-test, error: failed to refresh token, error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"invalid_request","error_description":"User-assigned managed identity not found on this resource. To request tokens for this identity, please assign it first. Please see aka.ms/ManagedIdentityNotFound for more details."}
Obviously the request for the token was performed on 13:29:39, before the assignment to the undelying node was completed.
EDIT: I inserted a 180 seconds sleep on start of the application and it did the trick.
@moriyoshi can you please provide the cluster details, resourceId of the VMSS and resourceId of the Identity to quickly confirm.
@rkammara12 I don't think I'm allowed to paste that kind of information here (as being an IC), but I should be able to open a new case, provide those there, and inform you the case number. Does it work for you?
@aramase can you please help getting the information? can we ask @moriyoshi to create a support ticket?
@rkammara12 Yeah, support ticket is the only way to get this information. @moriyoshi Could you create a support ticket with the details?
@aramase @rkammara12 Sure, I'm going to make that happen shortly.
@aramase @rkammara12 The case number is 2108030060002550. Hope this works.
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
still a live issue
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
Still no reason to believe that this has gone away (nor that inactivity will ever cause it to go away).
@dimbleby Are you still facing the identity not found
issue in your cluster? If yes, could you open a support ticket with the resource id info that has the error? We can then forward it to the IMDS team for investigation.
@aramase we have ICM 245096638 open, if you know who to poke to get this moving I'd be grateful!
(We are not hitting it right now; but have hit the same thing again since raising the initial report.)
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
not stale
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
I guess this is my life now
Hey @dimbleby. This is not a pod identity issue and AFAIK you have a support ticket internally for the IMDS team. Tagging @rkammara12 for any updates on the issue.
If you are comfortable that this is not a pod identity issue then please by all means close this out on those grounds.
I'd much rather that a maintainer made a decision, though, than that a bot close this just because I wasn't paying attention.
We're experiencing the same problem.
@pdefreitas What is the problem? Could you provide logs from the NMI pod? Also, could you share the az vmss identity show
output to confirm the identity being used exists on the vmss and the client id matches?
@rkammara12 Are there new issues the IMDS team is currently tracking for "Identity not found"?
Finding exact same issue with aad-pod-identity. My experience is that the issue "might" go away when retrying for ~5 minutes.
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
ghjk
@dimbleby Could you open a separate github issue with the error logs and timestamp?
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
ack
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
yes
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
We still experience this issue sometimes on aad-pod-identity 1.7.4:
Today we have again a lot of the following logs in one of our nmi pods:
E0118 08:45:12.104881 1 server.go:378] failed to get service principal token for pod: **REDACTED_POD**, error: failed to refresh token, error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"invalid_request","error_description":"User-assigned managed identity not found on this resource. To request tokens for this identity, please assign it first. Please see aka.ms/ManagedIdentityNotFound for more details."}
I0118 08:45:12.104911 1 server.go:192] status (403) took 102038922 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.0.3.30
I0118 08:45:17.108261 1 standard.go:80] clientID in request: **REDACTED_CLIENT_ID**, **REDACTED_POD** has been matched with azure identity **REDACTED_IDENTITY**
I0118 08:45:17.108317 1 standard.go:179] matched identityType:0 clientid:**REDACTED_CLIENT_ID** resource:https://ossrdbms-aad.database.windows.net
E0118 08:45:17.211192 1 server.go:378] failed to get service principal token for pod: **REDACTED_POD**, error: failed to refresh token, error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"invalid_request","error_description":"User-assigned managed identity not found on this resource. To request tokens for this identity, please assign it first. Please see aka.ms/ManagedIdentityNotFound for more details."}
I0118 08:45:17.211226 1 server.go:192] status (403) took 103376251 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.0.3.30
I0118 08:45:22.217911 1 standard.go:80] clientID in request: **REDACTED_CLIENT_ID**, **REDACTED_POD** has been matched with azure identity **REDACTED_IDENTITY**
I0118 08:45:22.217955 1 standard.go:179] matched identityType:0 clientid:**REDACTED_CLIENT_ID** resource:https://ossrdbms-aad.database.windows.net
[...]
The logs say that the identity is not assigned to the vmss, but the portal and following command azure cli command show it as assigned:
$ az vmss show -g MY-RESOURCE-GROUP -n THE-VMSS --query 'identity.userAssignedIdentities'
{
"/subscriptions/**REDACTED_SUBSCRIPTION_ID**/resourcegroups/**REDACTED_RESOURCE_GROUP**/providers/microsoft.managedidentity/userassignedidentities/**REDACTED_IDENTITY_NAME**": {
"clientId": "**REDACTED_CLIENT_ID**",
"principalId": "**REDACTED_PRINCIPAL_ID**"
},
[...]
}
Is it actually possible to find changes of the user assigned identity assignments to a vmss in any activity log in azure? I could not find these kind of logs when I assigned and unassigned a identity manually.
The error happened during a new deployment of the affected pod, using the userAssignedIdentity, lasted for about 20 minutes and then resolved itself. The old deployment (using a StatefulSet with a single replica) was running fine with the same identity.
@rkammara12 Could you help with the above questions and the error message from IMDS?
@PSanetra can you please help providing the resourceId of the User Assigned Identity and VMSS to check it further. There is an issue in the system where the artifacts on Azure host containing Identity Metadata are not updated which results in delays for getting tokens.
@rkammara12 Can we send you that information to some microsoft email address?
This issue is stale because it has been open 14 days with no activity. Please comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 21 days with no activity. Feel free to re-open if you are experiencing the issue again.
not stale
Have you
Describe the bug
We have an AKS cluster in which we run Jobs that execute in the context of a managed identity, using aad-pod-identity. This has been working just fine for the last 8 weeks, but as of 24 hours or so ago has started failing.
NMI logs look like this:
where the identity
7781##### REDACTED #####3951
is indeed the one that we would like to be using - and it does exist.Steps To Reproduce
Unclear!
Expected behavior
NMI should be able to find and use the identity.
AAD Pod Identity version
1.7.1
Kubernetes version
1.19.7
Additional context
The third of these is the identity that NMI is failing to find.
Restarting the aad-pod-identity pods did not help.
The most similar existing issues that I found were #681 / #606, which were "a caching bug in instance metadata server" - looks as though eventually resolved by a fix rolled out on Azure?
Could something like this be happening again? How do we tell whether this is a bug in the operator or in the platform?
Thanks!