Instance-per-Pod Admission Webhook (IPP) creates an IaaS instance per Kubernetes Pod
to mitigate potential container breakout attacks.
Unlike Kata Containers, IPP can even mitigate CPU vulnerabilities when baremetal instances (e.g. EC2 i3.metal
) are used.
Cluster Autoscaler must be enabled.
NodeRestriction Admission Controller or its equivalent must be enabled. Without NodeRestriction or its equivalent, IPP is not useful because a compromised node can run privileged pods on other nodes using the kubelet's credential. NodeRestriction is enabled by default on typical clusters including Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS). However, it is not enabled on Azure Kubernetes Service (AKS), as of December 2019.
Tested on Google Kubernetes Engine (GKE).
IPP Admission Webhook is implemented using Cluster Autoscaler, Tolerations, Node Affinity, and Pod Anti-Affinity.
See #2 for the design.
Create a GKE node pool with the following configuration:
"ipp" = "true"
"ipp" = "true"
(NO_SCHEDULE
mode)If you choose to use other label and taint names, you need to modify the YAML in Step 2 accordingly.
Non-GKE clusters should work as well, but not tested.
Install IPP Admission Webhook:
docker build -t $IMAGE . && docker push $IMAGE
./ipp.yaml.sh $IMAGE | kubectl apply -f -
You can review the YAML before running kubectl apply
.
Note that the YAML contains Secret
resources.
Create Pods with various ipp-class
labels.
e.g.
apiVersion: apps/v1
kind: Deployment
metadata:
name: foo
labels:
app: foo
ipp-class: class0
spec:
selector:
matchLabels:
app: foo
template:
metadata:
labels:
app: foo
ipp-class: class0
spec:
containers:
- name: nginx
image: nginx:alpine
IPP Admission Webhook automatically translates the Pod manifests as follows:
apiVersion: v1
kind: Pod
...
spec:
tolerations:
- effect: NoSchedule
key: ipp
operator: Equal
value: "true"
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: ipp
operator: In
values:
- "true"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: ipp-class
operator: NotIn
values:
- class0
topologyKey: kubernetes.io/hostname
...
Pods with different ipp-class
label values are never colocated on the same node.
When the existing node set is not sufficient to satisfy the scheduling constraint, the Cluster Autoscaler automatically adds a node. On GKE, creating a node takes about a minute.
The cluster autoscaler also automatically remove idle nodes. On GKE, an idle node is removed when it has been idle for about 10 minutes.
If it doesn't work as expected, check the log from the IPP Admission Webhook:
kubectl logs -f --namespace=ipp-system deployments/ipp
kubectl delete mutatingwebhookconfiguration ipp
kubectl delete namespace ipp-system
IPP Admission Webhook does not provide any guarantee for the actual Pod scheduling.
The current implementation of IPP Admission Webhook is implemented using Pod Anti-Affinity, which doesn't really scale.
Unfortunately, the current implementation of the affinity predicate in scheduler is about 3 orders of magnitude slower than for all other predicates combined, and it makes CA hardly usable on big clusters. https://github.com/kubernetes/autoscaler/blob/6ab78a85e19d55bd9c0ff1cb9f9f588a46522d6e/cluster-autoscaler/FAQ.md#what-are-the-service-level-objectives-for-cluster-autoscaler
For large clusters, we should also support affinity-less mode, which would explicitly call the IaaS API for creating and removing dedicated IaaS instances. Acutally, an early release of IPP Admission Webhook (v0.0.1) was implemented like that.
IPP Admission Webhook does not mutate DaemonSet, so that system daemon Pods can be colocated with IPP Pods.