devfile / devworkspace-operator

Apache License 2.0
59 stars 49 forks source link

Allow for configuring the webhook deployment from global DWOC; increase webhook replica count to 2 #1281

Open AObuchow opened 4 days ago

AObuchow commented 4 days ago

What does this PR do?

Since the webhook server is used for all devworkspaces, its configuration options only take effect when they are specified in the global DWOC.

Additionally, since the devworkspace-controller-manager is responsible for creating the webhook deployment, the devworkspace-controller-manager pod must be terminated (and automatically re-created by the deployment) for changes to the webhook configuration to take effect.

What issues does this PR fix or reference?

Fixes https://github.com/devfile/devworkspace-operator/issues/1272

Is it tested? How?

I recommend following the testing steps below in order, as they were written with this assumption in mind.

To set up for testing, you'll need a multi-node cluster. IIRC, requesting a gcp cluster from cluster bot should provide a multi-node cluster (e.g. launch 4.16 gcp). Minikube can be configured to have multiple nodes with minikube start --nodes <node-count>, e.g. minikube start --nodes 4 && minikube addons enable ingress.

I've pushed a build of DWO with the changes from this PR to quay.io/aobuchow/devworkspace-controller:configurable-webhook for ease of testing.

Once you have your multi-node cluster running with DWO installed, retrieve the list of nodes on the cluster with kubectl get nodes:

NAME           STATUS   ROLES           AGE     VERSION  
minikube       Ready    control-plane   8m10s   v1.30.0  
minikube-m02   Ready    <none>          7m49s   v1.30.0  
minikube-m03   Ready    <none>          7m36s   v1.30.0  
minikube-m04   Ready    <none>          7m23s   v1.30.0

Verifying nodeSelector

  1. Verify which node the devworkspace-webhook-server is currently running on: Do a kubectl get pod -n $NAMESPACE to find the webhook pod names. Then a kubectl get pod devworkspace-webhook-server... -n $NAMESPACE -o jsonpath='{.spec.nodeName}' for each webhook pod. In my case, the pods were scheduled onto nodes minikube-m03 and minikube-m04

  2. Add a label to the node which we want the webhook to be deployed: kubectl patch node <node-name> --type='merge' --patch '{"metadata": {"labels": {"my-label": "my-value"}}}'

  3. Modify the webhook configuration in the global DWOC to add a nodeSelector corresponding to the node label we just added: kubectl edit dwoc -n $NAMESPACE

    apiVersion: controller.devfile.io/v1alpha1  
    config:  
     routing:  
       clusterHostSuffix: 192.168.49.2.nip.io  
       defaultRoutingClass: basic  
    + webhook:  
    +   nodeSelector:  
    +     my-label: my-value  
       replicas: 2  
     workspace:  
       imagePullPolicy: Always  
    kind: DevWorkspaceOperatorConfig
  4. Terminate the devworkspace-controller-manager pod so that it modifies the webhook deployment based on the new webhook configuration in the DWOC: kubectl delete pod devworkspace-controller-manager-... -n $NAMESPACE

  5. Wait for the old webhook pods to terminate and for the new pods to start up successfully

  6. Verify that the new webhook pods were scheduled on the correct node which had your label applied: kubectl get pod devworkspace-webhook-server... -n $NAMESPACE -o jsonpath='{.spec.nodeName}' for each webhook pod.

Verifying tolerations

  1. Taint the node that you applied a label to in the previous step: kubectl taint nodes <name-of-node-with-label> key1=value1:NoExecute. All pods running on the tainted node will be evicted since we applied the NoExecute taint.

The webhook deployment will create pods scheduled onto other available/non-tainted nodes to fulfill the desired number of webhook replicas. However, since we have a nodeSelector targeting the tainted node, an additional webhook-server pod will remain in a pending state as it cannot be scheduled onto the tainted node.

  1. Modify DWOC to add a toleration that will allow the webhook server to be scheduled on the tainted node, and kill the devworkspace-controller-manager pod to modify webhook deployment:
apiVersion: controller.devfile.io/v1alpha1  
config:  
 routing:  
   clusterHostSuffix: 192.168.49.2.nip.io  
   defaultRoutingClass: basic  
 webhook:  
   nodeSelector:  
     my-label: my-value  
   replicas: 2  
+   tolerations:  
+   - effect: NoExecute  
+     key: key1  
+     operator: Equal  
+     value: value1  
 workspace:  
   imagePullPolicy: Always  
kind: DevWorkspaceOperatorConfig

You should see the webhook server pod that was previously in a pending state enter the running state. The 2 other webhook server replica pods will terminate and once will get recreated so that they are scheduled on the node with desired nodeSelector. Afterwards, there will only be 2 webhook server pods remaining on the cluster, and they should be running on the desired node.

Verifying replicas

  1. Modify the DWOC to increase the number of webhook server replicas:
    apiVersion: controller.devfile.io/v1alpha1  
    config:  
     routing:  
       clusterHostSuffix: 192.168.49.2.nip.io  
       defaultRoutingClass: basic  
     webhook:  
    +   replicas: 4  
     workspace:  
       imagePullPolicy: Always  
    kind: DevWorkspaceOperatorConfig
  2. Kill the devworkspace-controller-manager pod to have the devworkspace webhook server deployment updated.
  3. Ensure the devworkspace webhook server deployment has the correct number of replicas: `kubectl get deployment devworkspace-webhook-server -n $NAMESPACE -o jsonpath='{.spec.replicas}'
  4. Optional: try setting the number of webhook server replicas to 0 or a negative number. The CR validation should fail and prevent you from making the edit.

Config logging

When the DWOC webhook's configuration contains nodeSelectors and tolerations, the output resembles the following:

Updated config to [routing.clusterHostSuffix=192.168.49.2.nip.io,webhook.nodeSelectors=[my-label=my-value, my-label2=my-value2],webhook.tolerations=[&Toleration{Key:key1,Operator:Equal,Value:value1,Effect:NoExecute,TolerationSeconds:nil,}, &Toleration{Key:key2,Operator:Equal,Value:value2,Effect:NoExecute,TolerationSeconds:nil,}],enableExperimentalFeatures=true]

The formatting for Tolerations is a bit awkward but using the Kubernetes implementation of String() seems sufficient, rather than re-implementing it.

PR Checklist

openshift-ci[bot] commented 4 days ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AObuchow

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/devfile/devworkspace-operator/blob/main/OWNERS)~~ [AObuchow] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment