Open ondrejhlavacek opened 1 year ago
I found there is following error, looks like there is sth. wrong with your DiskEncryptionSet and Keyvault
I0609 08:34:48.621070 1 azure_armclient.go:291] Received error in WaitForAsyncOperationCompletion: 'Code="StorageFailure/KeyVaultAccessTokenCannotBeAcquired" Message="Error while creating storage object https://xxx.z43.blob.storage.azure.net/bzrwqstxnzbf/abcd Target: '/subscriptions/xxx/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a'." Target="36"'
E0609 08:34:48.621262 1 controllerserver.go:440] Attach volume /subscriptions/xxx/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a to instance aks-eck-29741929-vmss000010 failed with Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Code="StorageFailure/KeyVaultAccessTokenCannotBeAcquired" Message="Error while creating storage object https://xxx.z43.blob.storage.azure.net/bzrwqstxnzbf/abcd Target: '/subscriptions/xxx/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a'." Target="36"
That is interesting, all PVCs in the cluster are using CMK (via Disk Encryption Set) and other PVCs were moved without any issue. Where did you find this error?
I could search the error from azure disk csi driver controller, it's managed by aks.
Action required from @Azure/aks-pm
did the key by any chance get deleted?
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Describe the bug When rotating nodes with ECK operator (stateful set with PVCs) due to node pool upgrade a new pod on a new node is stuck in pending state. The disk is mounted to the node correctly and there it ends. More details in the screenshots.
I have performed the same operation on multiple clusters (in the same region) without any issues. This cluster's only difference is that it uses a Disk Encryption Set with key in another tenant. But other pods with PVCs on this affected cluster rotated correctly, so I don't suspect the encryption.
To Reproduce Not sure
Expected behavior Stateful set starts new nodes and attaches volumes.
Screenshots Overview of affected pod events
Full verbatim of some of the logs
Node events
CSI driver logs
Mounts in the CSI driver container (the PVC is not mounted)
Volume is attached to the node
The node is in failed state (if I drain the node to move the pod to a new node the same happens to new node as well)
Node status. The node is in Running state until the pod is scheduled to the instance. Then it is in Updating state and finishes in Failed. Happened to 4 nodes so far with this pod/statefulset. If I do not schedule the pod on the node it will stay in Running state.
Environment (please complete the following information):