Azure / azure-workload-identity

Azure AD Workload Identity uses Kubernetes primitives to associate managed identities for Azure resources and identities in Azure Active Directory (AAD) with pods.
https://azure.github.io/azure-workload-identity
MIT License
298 stars 91 forks source link

Best Practice to protect node Metadata endpoint when using AZWI #676

Open primeroz opened 1 year ago

primeroz commented 1 year ago

I am testing Azure Workload Identity , which is working as expected, but i am facing an issue i am not sure how to handle

Is there a best practice on how to handle this situation ?

with the now deprecated aad pod identity all requests to the Metadata Servers where intercepted and so a pod was not able to assume the node identity ( or a role not assigned to it ) but with the current implementation of AWI there is nothing preventing this.

At the moment my only idea is to

I want to understand if this is how the setup is supposed to look like or if i am missing something obvious :)

thanks

aramase commented 1 year ago

with the now deprecated aad pod identity all requests to the Metadata Servers where intercepted and so a pod was not able to assume the node identity ( or a role not assigned to it ) but with the current implementation of AWI there is nothing preventing this.

@primeroz With workload identity federation you don't need to assign the identity to the underlying VMSS. The reason the identities exist on the VMSS is because all the components use the user-assigned managed identity to get a token without workload identity federation. AAD Pod Identity intended that solve that problem by intercepting the connections but that doesn't work for host network pods. In case of workload identity federation, your token request is directly sent to AAD and if you don't have access to the service account or managed identity, you won't get a token. The workload identity model is more secure as it allows using user-assigned managed without assigning them. The goal would be for every component using user-assigned managed identity to eventually switch to using workload identity federation.

Is there a best practice on how to handle this situation ?

If you don't need all pods to access IMDS, you can setup network policy to block access.

primeroz commented 1 year ago

Thanks for the info.

you don't need to assign the identity to the underlying VMSS.

I would understand doing this but many of the core workloads i need to run don't seem to support WI yet , for example

so i need to run those workloads on a set of nodes with an attached identity and use the AZURE_CREDENTIALS_FILE setting to get them to authenticate.

At the moment i just relocated those workloads on the control plane nodes so that my worker nodes have no Identity assigned to them.

Does this make sense to you? or am i missing something so that i can also run the control plane nodes with no identity and have those workloads authenticate to azure somehow ?

thanks

primeroz commented 1 year ago

Hi @aramase i just wanted to check if my reply made any sense ?

I am still wondering if my current solution of forcing on the control plane nodes the workloads that need to auth with azure and do not support WI is the way to go or is there something else i can do.

thanks :pray:

aramase commented 1 year ago

I am still wondering if my current solution of forcing on the control plane nodes the workloads that need to auth with azure and do not support WI is the way to go or is there something else i can do.

@primeroz I would recommend checking with each individual project to see if it can be run only a subset of nodes. For instance, CSI drivers are node local process and need to be run on every single node where the disk needs to be attached. If you run those only on control plane nodes, workloads running in non-control plane nodes that need PV will fail.

primeroz commented 1 year ago

thanks. there are already issues on the CSI Drivers for azure ( for disk and files ) to actually get AZWI support so :crossed_fingers:

CSI drivers are node local process and need to be run on every single node where the disk needs to be attached. If you run those only on control plane nodes

The way that, at least the disk one, works is by having a csi controller which interacts with the cloud to manage volumes, attachments and so on ( and is the one i am forcing onto the control plane ) and the daemonset local node component which do not need to talk to the cloud so it does not need any permission.

The reason why i am asking this question though is that i can't be the only one facing this issue with the many controllers that needs to auth to azure so I am wondering if i am missing something.

The more i look at it the more i feel like this is the only way to run those controllers ( like CSI controller , cloud controller ) until they do support AZWI ... but maybe it should get documented somewhere ?