Azure / azure-sdk-tools

Tools repository leveraged by the Azure SDK team.
MIT License
114 stars 180 forks source link

Investigate Expired Managed Service Identity for `livevalidatePPE` #6799

Closed scbedd closed 1 year ago

scbedd commented 1 year ago

The resource in question

The symptom that an identity or associated credential has expired is that the cluster can't pull images to spin new pods.

Need to investigate why we can't pull when none of the credentials have expired.

scbedd commented 1 year ago

Talked with @weshaggard a bit today. The way that our aks works is there are virtual machine scale sets that are automatically managed by the aks cluster. We opened those up, then checked under identity.

The core of the issue for PPE is that the agent pools had lost the agentpool identity that was supposed to be assigned to them. Due to this, the pools couldn't talk to the ACR while spinning up. Re-adding, then wait for a bit of a crash loop got everything working again.