Azure / azure-container-networking

Azure Container Networking Solutions for Linux and Windows Containers
MIT License
377 stars 240 forks source link

Race when creating azure-vnet lock directory with permissions #2818

Open QxBytes opened 5 months ago

QxBytes commented 5 months ago

What happened: On the first boot, no CNI binary is on the node, and so k8s creates the /var/run/azure-vnet directory with 0755 permissions automatically because it is a mount part of the azure-cns daemonset. Then the CNI is deployed. The /var/run directory is not preserved between reboots. Then, when the VM reboots, the CNI binary may run before k8s creates the /var/run/azure-vnet directory. When the CNI binary runs first, it creates the directory with 0644 permissions. This causes permission denied errors for the cns. Even if k8s creates/mounts the /var/run/azure-vnet directory later, it will see it already exists and won't recreate the directory with the 0755 permissions.

What you expected to happen:
The CNI binary should create the directory with 0755 permissions.

How to reproduce it:
Reboot the VM with the cns capabilities security context dropping all capabilities (so it doesn't bypass permission checks). There is a chance that the azure-cns pod will get stuck in crash loop backoff.

Orchestrator and Version (e.g. Kubernetes, Docker):

Operating System (Linux/Windows):

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows):

Anything else we need to know?: [Miscellaneous information that will assist in solving the issue.]

rbtr commented 5 months ago
rbtr commented 5 months ago

https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods sounds like it would let the k8s chmod the directory when the CNS Pod mounts it as a Volume?

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

QxBytes commented 3 months ago

Node Rebooting Test added in: #2901

jpayne3506 commented 3 months ago

Backport PR can be created for release/v1.5, but the validate package within release/v1.4 needs to be updated to consume the work. Will leave this open until both have been merged

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

rbtr commented 2 months ago

@QxBytes looks like this was fully backported, is it good to resolve?

jpayne3506 commented 2 months ago

2901 did not get backported to release/v1.4 as the testing relies on the validation package being updated. Leaving this open until that work is complete and all E2E within this release train can validate Linux properly.

https://github.com/Azure/azure-container-networking/blob/73a0919606d6e7f1614e2d61ae176df705a44ae9/.pipelines/singletenancy/aks/e2e-step-template.yaml#L62-L75

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] commented 1 month ago

Issue closed due to inactivity.