Kong / gateway-operator

Kubernetes Operator for Kong Gateways
Apache License 2.0
49 stars 11 forks source link

Strategic merge clobbers some fields #128

Open rainest opened 7 months ago

rainest commented 7 months ago

Problem

We've found that the PodTemplateSpec merge behavior clobbers generated fields it shouldn't (more detail in Background). We want to only overwrite generated fields if the user patch specifies a value for that particular value.

Possible approaches

Use strategicpatch differently

As far as we know, other upstream Kubernetes tools use the strategicpatch library we currently use without the problems we've observed. We could review their implementation to try and determine what we need to do differently.

Previous discussion with API Machinery suggests that using strategicpatch in our code may not be viable, though they do not go into details why.

Use server-side apply

The API Machinery recommendation is to use server-side apply instead of patching client side. This should avoid the clobbering issues, but presumably requires applying the base resource and patch in separate calls to the Kubernetes API. This in turn presumably means that we'll end up with extra Deployment generations and unnecessary temporary ReplicaSets/Pods. We do not know of a way to tell the upstream controller to wait for a patch before spawning a new generation ReplicaSet.

https://kubernetes.io/blog/2021/08/06/server-side-apply-ga/#using-server-side-apply-in-a-controller discusses controller use of SSA further.

Use JSONPatch

JSONPatch is an alternate patch language. It defines specific actions (e.g. append, replace) for patch components to avoid clobbering, but AFAIK is limited in terms of overriding existing values for a specific key. It is also somewhat harder to write than strategic patches.

Invoke Kustomize

Kustomize supports strategic patches and manages to apply them without clobbering client-side. We could construct the equivalent of a Kustomize directory containing the base (generated) Deployment and a patch config with the user patch, then invoke Kustomize on it.

Having to construct a fake directory in memory and build the supporting metadata is annoying, but appears to be necessary for this approach based on cursory review of the Kustomize client code. There doesn't appear to be an immediately obvious way to just ask it to apply a single patch to a single resource.


Background

We use a wrapper around the API Machinery strategicpatch package to handle applying user patches for the ControlPlane Deployment and for the DataPlane Deployment. This feature takes a user-provided PodTemplateSpec and merges it with the controller-generated PodTemplateSpec, with the goal of letting users add extra fields or override specific generated ones, while leaving other generated fields as-is.

We've found that these patches do not quite work the same way when applied in our code as they do when applied via kubectl. Merge will sometimes obliterate generated configuration that it should not have. Arrays in particular are apparently merged via index, whereas kubectl will match on a merge key (for example, the name field of an EnvVar) and only replace entries if that key matches.

Workarounds

On the UX end, including generated values in the patch provides a crude workaround for this, but it's not really suitable outside of tests. Users should not need to retrieve generated values and keep track of changes to them to build their patches.

In code, we can perform our own approximation of the type-aware merge by extracting generated fields before merge and backfilling them in if we don't find an override. This adds a fair degree of complexity and does not scale well, since we need to find every problem field and add a backfill handler for it.

rainest commented 7 months ago

Kong/gateway-operator-archive#1489 has started the "Use strategicpatch differently" approach.

pmalek commented 7 months ago

Started a slack thread in #sig-api-machinery once again (now with more context): https://kubernetes.slack.com/archives/C0EG7JC6T/p1709303582824119

rainest commented 7 months ago

Notably, you apparently can do SSA client-side, which is a bit of an oxymoron, but:

Alternatives: merging using SSA's logic client-side, that would require you to have the openapi for your schema, SSAing twice

Generate OpenAPI From CRD Definitions: The easiest way is to json.Unmarshal the OpenAPIV3Schema field of the CRD into a kube-openapi spec.Schema. How could I do this SSA client-side: 1.) Create SSA type converter from the spec.Schema in above step: https://github.com/kubernetes/kubernetes/blob/4f3859ce911bd308dbf6f617a8c9e2bdf3a1883b/staging/src/k8s.io/apimachinery/pkg/util/managedfields/typeconverter.go#L36-L47 2.) Apply patch using something like this: https://gist.github.com/alexzielenski/580d76934ca6e1e4b0d128f651d70b95

programmer04 commented 1 month ago

It's happening for field volumes too

pmalek commented 1 month ago

It's happening for field volumes too

@programmer04 can you provide concrete examples where it does not work as expected?

programmer04 commented 1 month ago

Applying something like that

kind: GatewayConfiguration
apiVersion: gateway-operator.konghq.com/v1beta1
metadata:
  name: kong
  namespace: default
spec:
  dataPlaneOptions:
    deployment:
      replicas: 2
      podTemplateSpec:
        spec:
          containers:
            - name: proxy
              image: kong/kong-gateway:3.7
              readinessProbe:
                initialDelaySeconds: 1
                periodSeconds: 1
              volumeMounts:
                - name: example-cm
                  mountPath: /opt/kong/files
          volumes:
            - name: example-cm
              configMap:
                name: example-cm
  controlPlaneOptions:
    deployment:
      podTemplateSpec:
        spec:
          containers:
            - name: controller
              # renovate: datasource=docker versioning=docker
              image: kong/kubernetes-ingress-controller:3.2.3
              readinessProbe:
                initialDelaySeconds: 1
                periodSeconds: 1
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: example-cm
  namespace: default
data:
  test.txt: |
    Hello World!!!
---
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: kong
spec:
  controllerName: konghq.com/gateway-operator
  parametersRef:
    group: gateway-operator.konghq.com
    kind: GatewayConfiguration
    name: kong
    namespace: default
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: kong
  namespace: default
spec:
  gatewayClassName: kong
  listeners:
    - name: http
      protocol: HTTP
      port: 80

results in

[manager] 2024-08-12T13:31:14.339Z - ERROR - Reconciler error - {"controller": "dataplane", "controllerGroup": "gateway-operator.konghq.com", "controllerKind": "DataPlane", "DataPlane": {"name":"kong-m22gb","namespace":"default"}, "namespace": "default", "name": "kong-m22gb", "reconcileID": "d6c0120f-0d82-43e5-b73a-c56083155eb9", "error": "could not build Deployment for DataPlane default/kong-m22gb: failed creating Deployment for DataPlane kong-m22gb: Deployment.apps \"dataplane-kong-m22gb-w87ps\" is invalid: [spec.template.spec.volumes[0].configMap: Forbidden: may not specify more than 1 volume type, spec.template.spec.containers[0].volumeMounts[0].name: Not found: \"example-cm\"]"}

because the Deployment that operator tries to submit looks like that

{
  "metadata": {
    "generateName": "dataplane-kong-m22gb-",
    "namespace": "default",
    "creationTimestamp": null,
    "labels": {
      "app": "kong-m22gb",
      "gateway-operator.konghq.com/dataplane-deployment-state": "live",
      "gateway-operator.konghq.com/managed-by": "dataplane",
      "gateway-operator.konghq.com/selector": "02c83d63-658c-4929-808c-6d19a25fa4db",
      "konghq.com/gateway-operator": "dataplane"
    },
    "ownerReferences": [
      {
        "apiVersion": "gateway-operator.konghq.com/v1beta1",
        "kind": "DataPlane",
        "name": "kong-m22gb",
        "uid": "aaff6e89-e4ea-4380-8c58-cf64c57a5b99",
        "controller": true
      }
    ],
    "finalizers": [
      "gateway-operator.konghq.com/wait-for-owner"
    ]
  },
  "spec": {
    "replicas": 2,
    "selector": {
      "matchLabels": {
        "app": "kong-m22gb",
        "gateway-operator.konghq.com/selector": "02c83d63-658c-4929-808c-6d19a25fa4db"
      }
    },
    "template": {
      "metadata": {
        "creationTimestamp": null,
        "labels": {
          "app": "kong-m22gb",
          "gateway-operator.konghq.com/selector": "02c83d63-658c-4929-808c-6d19a25fa4db"
        }
      },
      "spec": {
        "volumes": [
          {
            "name": "example-cm",
            "secret": {
              "secretName": "dataplane-kong-m22gb-2sldc",
              "items": [
                {
                  "key": "tls.crt",
                  "path": "tls.crt"
                },
                {
                  "key": "tls.key",
                  "path": "tls.key"
                },
                {
                  "key": "ca.crt",
                  "path": "ca.crt"
                }
              ]
            },
            "configMap": {
              "name": "example-cm",
              "defaultMode": 420
            }
          },
          {
            "name": "cluster-certificate",
            "secret": {
              "secretName": "dataplane-kong-m22gb-2sldc",
              "items": [
                {
                  "key": "tls.crt",
                  "path": "tls.crt"
                },
                {
                  "key": "tls.key",
                  "path": "tls.key"
                },
                {
                  "key": "ca.crt",
                  "path": "ca.crt"
                }
              ]
            }
          }
        ],
        "containers": [
          {
            "name": "proxy",
            "image": "kong/kong-gateway:3.7",
            "ports": [
              {
                "name": "proxy",
                "containerPort": 8000,
                "protocol": "TCP"
              },
              {
                "name": "proxy-ssl",
                "containerPort": 8443,
                "protocol": "TCP"
              },
              {
                "name": "metrics",
                "containerPort": 8100,
                "protocol": "TCP"
              },
              {
                "name": "admin-ssl",
                "containerPort": 8444,
                "protocol": "TCP"
              }
            ],
            "env": [
              {
                "name": "KONG_ADMIN_ACCESS_LOG",
                "value": "/dev/stdout"
              },
              {
                "name": "KONG_ADMIN_ERROR_LOG",
                "value": "/dev/stderr"
              },
              {
                "name": "KONG_ADMIN_GUI_ACCESS_LOG",
                "value": "/dev/stdout"
              },
              {
                "name": "KONG_ADMIN_GUI_ERROR_LOG",
                "value": "/dev/stderr"
              },
              {
                "name": "KONG_ADMIN_LISTEN",
                "value": "0.0.0.0:8444 ssl reuseport backlog=16384"
              },
              {
                "name": "KONG_ADMIN_SSL_CERT",
                "value": "/var/cluster-certificate/tls.crt"
              },
              {
                "name": "KONG_ADMIN_SSL_CERT_KEY",
                "value": "/var/cluster-certificate/tls.key"
              },
              {
                "name": "KONG_CLUSTER_CERT",
                "value": "/var/cluster-certificate/tls.crt"
              },
              {
                "name": "KONG_CLUSTER_CERT_KEY",
                "value": "/var/cluster-certificate/tls.key"
              },
              {
                "name": "KONG_CLUSTER_LISTEN",
                "value": "off"
              },
              {
                "name": "KONG_DATABASE",
                "value": "off"
              },
              {
                "name": "KONG_NGINX_ADMIN_SSL_CLIENT_CERTIFICATE",
                "value": "/var/cluster-certificate/ca.crt"
              },
              {
                "name": "KONG_NGINX_ADMIN_SSL_VERIFY_CLIENT",
                "value": "on"
              },
              {
                "name": "KONG_NGINX_ADMIN_SSL_VERIFY_DEPTH",
                "value": "3"
              },
              {
                "name": "KONG_NGINX_WORKER_PROCESSES",
                "value": "2"
              },
              {
                "name": "KONG_PLUGINS",
                "value": "bundled"
              },
              {
                "name": "KONG_PORTAL_API_ACCESS_LOG",
                "value": "/dev/stdout"
              },
              {
                "name": "KONG_PORTAL_API_ERROR_LOG",
                "value": "/dev/stderr"
              },
              {
                "name": "KONG_PORT_MAPS",
                "value": "80:8000, 443:8443"
              },
              {
                "name": "KONG_PROXY_ACCESS_LOG",
                "value": "/dev/stdout"
              },
              {
                "name": "KONG_PROXY_ERROR_LOG",
                "value": "/dev/stderr"
              },
              {
                "name": "KONG_PROXY_LISTEN",
                "value": "0.0.0.0:8000 reuseport backlog=16384, 0.0.0.0:8443 http2 ssl reuseport backlog=16384"
              },
              {
                "name": "KONG_STATUS_LISTEN",
                "value": "0.0.0.0:8100"
              }
            ],
            "resources": {
              "limits": {
                "cpu": "1",
                "memory": "1000Mi"
              },
              "requests": {
                "cpu": "100m",
                "memory": "20Mi"
              }
            },
            "volumeMounts": [
              {
                "name": "example-cm",
                "readOnly": true,
                "mountPath": "/opt/kong/files"
              },
              {
                "name": "cluster-certificate",
                "readOnly": true,
                "mountPath": "/var/cluster-certificate"
              }
            ],
            "readinessProbe": {
              "httpGet": {
                "path": "/status/ready",
                "port": 8100,
                "scheme": "HTTP"
              },
              "initialDelaySeconds": 5,
              "timeoutSeconds": 1,
              "periodSeconds": 10,
              "successThreshold": 1,
              "failureThreshold": 3
            },
            "lifecycle": {
              "preStop": {
                "exec": {
                  "command": [
                    "/bin/sh",
                    "-c",
                    "kong quit"
                  ]
                }
              }
            },
            "terminationMessagePath": "/dev/termination-log",
            "terminationMessagePolicy": "File",
            "imagePullPolicy": "IfNotPresent"
          }
        ],
        "restartPolicy": "Always",
        "terminationGracePeriodSeconds": 30,
        "dnsPolicy": "ClusterFirst",
        "securityContext": {},
        "schedulerName": "default-scheduler"
      }
    },
    "strategy": {
      "type": "RollingUpdate",
      "rollingUpdate": {
        "maxUnavailable": 0,
        "maxSurge": 1
      }
    },
    "revisionHistoryLimit": 10,
    "progressDeadlineSeconds": 600
  },
  "status": {}
}

clobbered merge

...
"spec": {
  "volumes": [
    {
      "name": "example-cm",
      "secret": {
        "secretName": "dataplane-kong-m22gb-2sldc",
        "items": [
          {
            "key": "tls.crt",
            "path": "tls.crt"
          },
          {
            "key": "tls.key",
            "path": "tls.key"
          },
          {
            "key": "ca.crt",
            "path": "ca.crt"
          }
        ]
      },
      "configMap": {
        "name": "example-cm",
        "defaultMode": 420
      }
    },
...

example-cm should have its entry in spec.volumes array - see related upstream issue

pmalek commented 1 month ago

I see. I believe #255 could help here but I'd have to double check. That PR has been targeted for 1.4 but no one worked on it since I proposed this (draft) change in May.

I'm happy to talk about it thought.

programmer04 commented 1 month ago

I skimmed through #255 and it looks reasonable. Definitely we should fix it