carvel-dev / kapp

kapp is a simple deployment tool focused on the concept of "Kubernetes application" — a set of resources with the same label
https://carvel.dev/kapp
Apache License 2.0
911 stars 109 forks source link

update-strategy fallback-on-replace creates orphaned PODs #152

Closed erSitzt closed 3 years ago

erSitzt commented 4 years ago

Hi,

we had a lot of deployments in kubernetes with labels that needed a correction, so we startet to use the annotation "update-strategy" and the cmd arg --apply-default-update-strategy fallback-on-replace to avoid lots of those immutable errors.

But now im seeing a behavior where kapp does not create/update the deployment, but does create additional PODs next to the deployment.

`fokusartikel-v0-9-2``is my updated deployment

feedback-live                 1/1     1            1           113d
fokusartikel-v0-9-2           1/1     1            1           8m33s
fokusartikel-vslive           1/1     1            1           72d

And these PODs are running after the update

feedback-live-78f5c85c6b-v57lg                 2/2     Running   0          72d
fokusartikel-v0-9-2-757c9576fb-d5ff7           2/2     Running   0          8m39s
fokusartikel-v0-9-2-757c9576fb-dcp8w           2/2     Running   0          8m40s
fokusartikel-v0-9-2-757c9576fb-dz7rv           2/2     Running   0          8m38s
fokusartikel-v0-9-2-757c9576fb-fsdsj           2/2     Running   0          8m39s
fokusartikel-vslive-6f8f978bb4-d72jw           2/2     Running   0          72d

fokusartikel-v0-9-2-757c9576fb-dz7rv ist the correct one belonging to the deployment

The others stick around even if i remove the deployment or scale it.

This is the kapp-log from my deployment

$ kustomize build | kapp --color --logs --diff-changes --apply-default-update-strategy fallback-on-replace --wait-timeout 5m -y deploy -n kapp-apps -a ${NS}-${DEPLOYMENTNAME} -f -
Target cluster 'https://10.20.30.40:6443' (nodes: mydomain-k8s-master-001.mydomain-hosting.lan, 6+)
@@ update service/fokusartikel-v0-9-2 (v1) namespace: neo @@
  ...
 10, 10       kapp.k14s.io/association: v1.dda58e6625437dab865dd0cc0015a6ff
     11 +     kapp.k14s.io/update-strategy: fallback-on-replace
 11, 12     name: fokusartikel-v0-9-2
 12, 13     namespace: neo
  ...
 24, 25       app: fokusartikel
 25     -     app.kubernetes.io/instance: fokusartikel-v0-9-2-kubernetes1
 26     -     app.kubernetes.io/name: fokusartikel-v0-9-2-kubernetes1
     26 +     app.kubernetes.io/instance: fokusartikel-v0-9-2
     27 +     app.kubernetes.io/name: fokusartikel-v0-9-2
 27, 28       kapp.k14s.io/app: "1596462641761544424"
 28, 29     type: ClusterIP
@@ update deployment/fokusartikel-v0-9-2 (apps/v1) namespace: neo @@
  ...
  7,  7     labels:
  8     -     app.kubernetes.io/instance: fokusartikel-v0-9-2-kubernetes1
      8 +     app.kubernetes.io/instance: fokusartikel-v0-9-2
  9,  9       app.kubernetes.io/managed-by: Helm
 10     -     app.kubernetes.io/name: fokusartikel-v0-9-2-kubernetes1
     10 +     app.kubernetes.io/name: fokusartikel-v0-9-2
 11, 11       helm.sh/chart: fokusartikel-v0-9-2-0.1.0
 12, 12       kapp.k14s.io/app: "1596462641761544424"
  ...
 22, 22       matchLabels:
 23     -       app.kubernetes.io/instance: fokusartikel-v0-9-2-kubernetes1
 24     -       app.kubernetes.io/name: fokusartikel-v0-9-2-kubernetes1
     23 +       app.kubernetes.io/instance: fokusartikel-v0-9-2
     24 +       app.kubernetes.io/name: fokusartikel-v0-9-2
 25, 25         kapp.k14s.io/app: "1596462641761544424"
 26, 26     template:
  ...
 31, 31           kapp.k14s.io/deploy-logs-container-names: fokusartikel-v0-9-2
     32 +         kapp.k14s.io/update-strategy: fallback-on-replace
 32, 33         labels:
 33, 34           app: fokusartikel
 34     -         app.kubernetes.io/instance: fokusartikel-v0-9-2-kubernetes1
 35     -         app.kubernetes.io/name: fokusartikel-v0-9-2-kubernetes1
     35 +         app.kubernetes.io/instance: fokusartikel-v0-9-2
     36 +         app.kubernetes.io/name: fokusartikel-v0-9-2
 36, 37           kapp.k14s.io/app: "1596462641761544424"
 37, 38           kapp.k14s.io/association: v1.f0dd8a9286499852908f3e590e4aad00
  ...
 47, 48             value: "false"
 48     -         image: srv-nexus-docker-registry.mydomain.com/neo/fokusartikel-neo:v0.9.2-kubernetes1-26605
     49 +         image: srv-nexus-docker-registry.mydomain.com/neo/fokusartikel-neo:v0.9.2-kubernetes2-29418
 49, 50           imagePullPolicy: IfNotPresent
 50, 51           livenessProbe:
  ...
 62, 63               port: http
 63     -         resources: {}
 64, 64           volumeMounts:
 65, 65           - mountPath: /usr/src/app/config/env.js
@@ update virtualservice/fokusartikel-v0-9-2 (networking.istio.io/v1alpha3) namespace: neo @@
  ...
 19, 19     - fokusartikel-v0-9-2.neo.svc.mydomain-k8s.local
     20 +   - fokusartikel-v0-9-2-neo.prod.mydomain.com
 20, 21     - fokusartikel-neo.services.mydomain.com
 21, 22     http:
Changes
Namespace  Name                 Kind            Conds.  Age   Op      Op st.               Wait to    Rs  Ri  
neo        fokusartikel-v0-9-2  Deployment      2/2 t   72d   update  fallback on replace  reconcile  ok  -  
^          fokusartikel-v0-9-2  Service         -       132d  update  fallback on replace  reconcile  ok  -  
^          fokusartikel-v0-9-2  VirtualService  -       72d   update  fallback on replace  reconcile  ok  -  
Op:      0 create, 0 delete, 3 update, 0 noop
Wait to: 3 reconcile, 0 delete, 0 noop
12:32:35PM: ---- applying 3 changes [0/3 done] ----
12:32:35PM: update virtualservice/fokusartikel-v0-9-2 (networking.istio.io/v1alpha3) namespace: neo
12:32:35PM: update service/fokusartikel-v0-9-2 (v1) namespace: neo
12:32:35PM: update deployment/fokusartikel-v0-9-2 (apps/v1) namespace: neo
12:32:35PM: ---- waiting on 3 changes [0/3 done] ----
12:32:35PM: ok: reconcile virtualservice/fokusartikel-v0-9-2 (networking.istio.io/v1alpha3) namespace: neo
12:32:35PM: ok: reconcile service/fokusartikel-v0-9-2 (v1) namespace: neo
logs | # waiting for 'fokusartikel-v0-9-2-757c9576fb-dcp8w > fokusartikel-v0-9-2' logs to become available...
12:32:36PM: ongoing: reconcile deployment/fokusartikel-v0-9-2 (apps/v1) namespace: neo
12:32:36PM:  ^ Waiting for generation 2 to be observed
12:32:36PM:  L ok: waiting on replicaset/fokusartikel-v0-9-2-757c9576fb (apps/v1) namespace: neo
12:32:36PM:  L ok: waiting on replicaset/fokusartikel-v0-9-2-68f57bd9db (apps/v1) namespace: neo
12:32:36PM:  L ok: waiting on pod/fokusartikel-v0-9-2-68f57bd9db-qrlj9 (v1) namespace: neo
12:32:36PM: ---- waiting on 1 changes [2/3 done] ----
logs | # waiting for 'fokusartikel-v0-9-2-757c9576fb-d5ff7 > fokusartikel-v0-9-2' logs to become available...
logs | # waiting for 'fokusartikel-v0-9-2-757c9576fb-fsdsj > fokusartikel-v0-9-2' logs to become available...
12:32:37PM: ongoing: reconcile deployment/fokusartikel-v0-9-2 (apps/v1) namespace: neo
12:32:37PM:  ^ Waiting for 1 unavailable replicas
12:32:37PM:  L ok: waiting on replicaset/fokusartikel-v0-9-2-757c9576fb (apps/v1) namespace: neo
12:32:37PM:  L ok: waiting on replicaset/fokusartikel-v0-9-2-68f57bd9db (apps/v1) namespace: neo
12:32:37PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dcp8w (v1) namespace: neo
12:32:37PM:     ^ Pending: PodInitializing
12:32:37PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-d5ff7 (v1) namespace: neo
12:32:37PM:     ^ Pending
12:32:37PM:  L ok: waiting on pod/fokusartikel-v0-9-2-68f57bd9db-qrlj9 (v1) namespace: neo
logs | # waiting for 'fokusartikel-v0-9-2-757c9576fb-dz7rv > fokusartikel-v0-9-2' logs to become available...
12:32:38PM: ongoing: reconcile deployment/fokusartikel-v0-9-2 (apps/v1) namespace: neo
12:32:38PM:  ^ Waiting for 1 unavailable replicas
12:32:38PM:  L ok: waiting on replicaset/fokusartikel-v0-9-2-757c9576fb (apps/v1) namespace: neo
12:32:38PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-fsdsj (v1) namespace: neo
12:32:38PM:     ^ Pending: PodInitializing
12:32:38PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dz7rv (v1) namespace: neo
12:32:38PM:     ^ Pending: PodInitializing
12:32:38PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dcp8w (v1) namespace: neo
12:32:38PM:     ^ Pending: PodInitializing
12:32:38PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-d5ff7 (v1) namespace: neo
12:32:38PM:     ^ Pending: PodInitializing
12:32:38PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-68f57bd9db-qrlj9 (v1) namespace: neo
12:32:38PM:     ^ Deleting
logs | # starting tailing 'fokusartikel-v0-9-2-757c9576fb-fsdsj > fokusartikel-v0-9-2' logs
logs | fokusartikel-v0-9-2-757c9576fb-fsdsj > fokusartikel-v0-9-2 | 
logs | fokusartikel-v0-9-2-757c9576fb-fsdsj > fokusartikel-v0-9-2 | > fokusartikelservice@0.1.0 start /usr/src/app
logs | fokusartikel-v0-9-2-757c9576fb-fsdsj > fokusartikel-v0-9-2 | > node --harmony app.js
logs | fokusartikel-v0-9-2-757c9576fb-fsdsj > fokusartikel-v0-9-2 | 
logs | fokusartikel-v0-9-2-757c9576fb-fsdsj > fokusartikel-v0-9-2 | Server Port: 8888
12:32:54PM: ongoing: reconcile deployment/fokusartikel-v0-9-2 (apps/v1) namespace: neo
12:32:54PM:  ^ Waiting for 1 unavailable replicas
12:32:54PM:  L ok: waiting on replicaset/fokusartikel-v0-9-2-757c9576fb (apps/v1) namespace: neo
12:32:54PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-fsdsj (v1) namespace: neo
12:32:54PM:     ^ Condition Ready is not True (False)
12:32:54PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dz7rv (v1) namespace: neo
12:32:54PM:     ^ Pending: PodInitializing
12:32:54PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dcp8w (v1) namespace: neo
12:32:54PM:     ^ Pending: PodInitializing
12:32:54PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-d5ff7 (v1) namespace: neo
12:32:54PM:     ^ Pending: PodInitializing
12:32:54PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-68f57bd9db-qrlj9 (v1) namespace: neo
12:32:54PM:     ^ Deleting
logs | # starting tailing 'fokusartikel-v0-9-2-757c9576fb-d5ff7 > fokusartikel-v0-9-2' logs
logs | # starting tailing 'fokusartikel-v0-9-2-757c9576fb-dz7rv > fokusartikel-v0-9-2' logs
logs | fokusartikel-v0-9-2-757c9576fb-dz7rv > fokusartikel-v0-9-2 | 
logs | fokusartikel-v0-9-2-757c9576fb-dz7rv > fokusartikel-v0-9-2 | > fokusartikelservice@0.1.0 start /usr/src/app
logs | fokusartikel-v0-9-2-757c9576fb-dz7rv > fokusartikel-v0-9-2 | > node --harmony app.js
logs | fokusartikel-v0-9-2-757c9576fb-dz7rv > fokusartikel-v0-9-2 | 
logs | fokusartikel-v0-9-2-757c9576fb-d5ff7 > fokusartikel-v0-9-2 | 
logs | fokusartikel-v0-9-2-757c9576fb-d5ff7 > fokusartikel-v0-9-2 | > fokusartikelservice@0.1.0 start /usr/src/app
logs | fokusartikel-v0-9-2-757c9576fb-d5ff7 > fokusartikel-v0-9-2 | > node --harmony app.js
logs | fokusartikel-v0-9-2-757c9576fb-d5ff7 > fokusartikel-v0-9-2 | 
logs | # starting tailing 'fokusartikel-v0-9-2-757c9576fb-dcp8w > fokusartikel-v0-9-2' logs
logs | fokusartikel-v0-9-2-757c9576fb-dcp8w > fokusartikel-v0-9-2 | 
logs | fokusartikel-v0-9-2-757c9576fb-dcp8w > fokusartikel-v0-9-2 | > fokusartikelservice@0.1.0 start /usr/src/app
logs | fokusartikel-v0-9-2-757c9576fb-dcp8w > fokusartikel-v0-9-2 | > node --harmony app.js
logs | fokusartikel-v0-9-2-757c9576fb-dcp8w > fokusartikel-v0-9-2 | 
logs | fokusartikel-v0-9-2-757c9576fb-dz7rv > fokusartikel-v0-9-2 | Server Port: 8888
logs | fokusartikel-v0-9-2-757c9576fb-dcp8w > fokusartikel-v0-9-2 | Server Port: 8888
logs | fokusartikel-v0-9-2-757c9576fb-d5ff7 > fokusartikel-v0-9-2 | Server Port: 8888
12:33:11PM: ongoing: reconcile deployment/fokusartikel-v0-9-2 (apps/v1) namespace: neo
12:33:11PM:  ^ Waiting for 1 unavailable replicas
12:33:11PM:  L ok: waiting on replicaset/fokusartikel-v0-9-2-757c9576fb (apps/v1) namespace: neo
12:33:11PM:  L ok: waiting on pod/fokusartikel-v0-9-2-757c9576fb-fsdsj (v1) namespace: neo
12:33:11PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dz7rv (v1) namespace: neo
12:33:11PM:     ^ Condition Ready is not True (False)
12:33:11PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dcp8w (v1) namespace: neo
12:33:11PM:     ^ Condition Ready is not True (False)
12:33:11PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-d5ff7 (v1) namespace: neo
12:33:11PM:     ^ Condition Ready is not True (False)
12:33:11PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-68f57bd9db-qrlj9 (v1) namespace: neo
12:33:11PM:     ^ Deleting
12:33:14PM: ongoing: reconcile deployment/fokusartikel-v0-9-2 (apps/v1) namespace: neo
12:33:14PM:  ^ Waiting for 1 unavailable replicas
12:33:14PM:  L ok: waiting on replicaset/fokusartikel-v0-9-2-757c9576fb (apps/v1) namespace: neo
12:33:14PM:  L ok: waiting on pod/fokusartikel-v0-9-2-757c9576fb-fsdsj (v1) namespace: neo
12:33:14PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dz7rv (v1) namespace: neo
12:33:14PM:     ^ Condition Ready is not True (False)
12:33:14PM:  L ok: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dcp8w (v1) namespace: neo
12:33:14PM:  L ok: waiting on pod/fokusartikel-v0-9-2-757c9576fb-d5ff7 (v1) namespace: neo
12:33:14PM:  L ongoing: waiting on pod/fokusartikel-v0-9-2-68f57bd9db-qrlj9 (v1) namespace: neo
12:33:14PM:     ^ Deleting
12:33:15PM: ok: reconcile deployment/fokusartikel-v0-9-2 (apps/v1) namespace: neo
12:33:15PM: ---- applying complete [3/3 done] ----
12:33:15PM: ---- waiting complete [3/3 done] ----
Succeeded

Target Cluster is Kubernetes v1.16.8 but my colleagues had the same issue in v1.18.8

cppforlife commented 4 years ago

fokusartikel-v0-9-2-757c9576fb-dz7rv ist the correct one belonging to the deployment. The others stick around even if i remove the deployment or scale it.

kapp itself does not create pods directly. all its doing is creating/updating/deleting Deployment resource, and expects that Kubernetes's Deployment controller does the right thing.

12:32:38PM: L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-fsdsj (v1) namespace: neo 12:32:38PM: ^ Pending: PodInitializing 12:32:38PM: L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dz7rv (v1) namespace: neo 12:32:38PM: ^ Pending: PodInitializing 12:32:38PM: L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dcp8w (v1) namespace: neo 12:32:38PM: ^ Pending: PodInitializing 12:32:38PM: L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-d5ff7 (v1) namespace: neo 12:32:38PM: ^ Pending: PodInitializing

given above progress output and k8s Deployment naming conventions (they all share fokusartikel-v0-9-2-757c9576fb as a prefix which means they all part of same replicaset) it seems you have a deployment that says spec.replicas=4. which is indeed what the progress log above is showing.

fokusartikel-v0-9-2-757c9576fb-dz7rv ist the correct one belonging to the deployment

what is the criteria of being "correct" here? are you saying that -fsdsj, -dz7rv, -d5ff7 shouldnt exist? if so you'll have to provide more YAML for us to look at here since to me this all looks normal.

erSitzt commented 4 years ago

It's a deployment with spec.replicas=1 and this only happens when we update a deployment where the labels changed, so it needs to be recreated. This never once happend using kapp with the same workflow when labels stay the same. I dont want to blame this on kapp doing something wrong, but maybe it is a timing issue when deleting / recreating the deployment, which results in pods that are "orphaned" ?

I'll try to get the yaml of the single pods next time this happens, to see if it differs from the pod shown in the deployment

I think I can provide the yaml that was deployed in one of these cases though.

erSitzt commented 4 years ago

So this is the deployment.yaml from the logs above...

---
# Source: fokusartikel-v0-9-2/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fokusartikel-v0-9-2
  namespace: neo
  labels:
    app.kubernetes.io/name: fokusartikel-v0-9-2
    helm.sh/chart: fokusartikel-v0-9-2-0.1.0
    app.kubernetes.io/instance: fokusartikel-v0-9-2
    app.kubernetes.io/managed-by: Helm
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: fokusartikel-v0-9-2
      app.kubernetes.io/instance: fokusartikel-v0-9-2
  template:
    metadata:
      labels:
        app.kubernetes.io/name: fokusartikel-v0-9-2
        app.kubernetes.io/instance: fokusartikel-v0-9-2
        app: fokusartikel
        version: v0.9.2
      annotations:
        field.cattle.io/workloadMetrics: '[{"path":"/metrics","port":8888,"schema":"HTTP"}]'  
        kapp.k14s.io/deploy-logs: ""
        kapp.k14s.io/deploy-logs-container-names: fokusartikel-v0-9-2
        kapp.k14s.io/update-strategy: "fallback-on-replace"
    spec:

      serviceAccountName: neo

      volumes:
        - name: vault-token
          emptyDir: 
            medium: Memory     

        - name: rendered-configs
          emptyDir: {}

        - name: vault-config
          configMap:
            name: fokusartikel-v0-9-2-consul-template-configs
            items:
              - key: vault.hcl
                path: vault.hcl

        - name: consul-templates
          configMap:
            name: fokusartikel-v0-9-2-configs

      initContainers:
        # Vault container
        - name: vault-agent-auth
          image: vault

          volumeMounts:
            - name: vault-config
              mountPath: /etc/vault
            - name: vault-token
              mountPath: /home/vault
            - name: consul-templates
              mountPath: /configs
            - name: rendered-configs
              mountPath: /rendered-configs
          env:
            - name: APP_VERSION
              value: v0.9.2
            - name: CLUSTER_ENV
              value: netde-prod
            - name: VAULT_ADDR
              value: https://vault.mydomain.com
            - name: HOME
              value: /home/vault
          args:
            [
              "agent",
              "-config=/etc/vault/vault.hcl",
              "-log-level=debug",
            ]

      containers:         
        - name: fokusartikel-v0-9-2
          image: "srv-nexus-docker-registry.mydomain.com/neo/fokusartikel-neo:v0.9.2-kubernetes2-29432"
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort: 8888
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /health
              port: http
          readinessProbe:
            httpGet:
              path: /health
              port: http

          volumeMounts:
          - name: rendered-configs
            mountPath: /usr/src/app/config/env.js
            subPath: usr/src/app/config/env.js

          env:
            - name: APP_VERSION
              value: v0.9.2
            - name: CLUSTER_ENV
              value: netde-prod
            - name: METRIC_DEBUG
              value: "false"

      imagePullSecrets:
        - name: expertreg
cppforlife commented 4 years ago

i dont see anything unusual about your Deployment. could you include the Deployment YAML output from cluster (i see that this is output from helm template). may be you have something else changing spec.replicas to be 3 (like an HPA).

This never once happend using kapp with the same workflow when labels stay the same. I dont want to blame this on kapp doing something wrong, but maybe it is a timing issue when deleting / recreating the deployment, which results in pods that are "orphaned" ?

does this happen consistently for you? it cant really be a timing issue since k8s is by nature converging resources so it would notice the disparity of pods not having "owning" deployment.

just to try it outin my env, i've created Deployment [0] and updated its selector which made kapp recreate it in the next deploy. i saw that old pods were terminated (slightly after new pods came up) by k8s and in parallel same pods got created.

[0]

---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: default
  name: app
spec:
  selector:
    matchLabels:
      simple-app2: ""
  template:
    metadata:
      labels:
        simple-app2: ""
    spec:
      containers:
      - name: simple-app
        image: docker.io/dkalinin/k8s-simple-app@sha256:4c8b96d4fffdfae29258d94a22ae4ad1fe36139d47288b8960d9958d1e63a9d0
        env:
        - name: HELLO_MSG
          value: foo
erSitzt commented 4 years ago

it startet to happen quite often as we started changing our labels... but im not sure it happens every time. Some of my deployments already had the annotation for fallback, some didnt, so i added the default option for fallback-on-replace to the cmd. I think after that i saw it for the first time...

cppforlife commented 4 years ago

btw unrelated to above issue you have kapp.k14s.io/update-strategy: "fallback-on-replace" added to your pod metadata instead of your Deployment in your above YAML snippet.

cppforlife commented 4 years ago

it startet to happen quite often as we started changing our labels... but im not sure it happens every time.

hmm, yeah if it doesnt happen consistently im not sure what would be our next steps since this functionality is in k8s itself. i would be happy to take a look at your environment over zoom if this issue does occur again. alternatively feel free to dump deployment, replicaset, pod yamls via kubectl get pod,rs,deploy -oyaml and we can take a look at that instead.

erSitzt commented 3 years ago

Ok, here is the output of the latest occurrence

deployment-session-v0-9-2.zip

I have included the yaml of one of the wrong pods... this is what happend in total

❯ kubectl get deployments.apps -n neo session-v0-9-2
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
session-v0-9-2   1/1     1            1           117m
❯ kubectl get rs -n neo | grep session
session-v0-9-2-c8b4cbf9d               1         1         1       117m
❯ kubectl get pods -n neo | grep session
session-v0-9-2-c8b4cbf9d-6pb5z               2/2     Running   2          117m
session-v0-9-2-c8b4cbf9d-jdtjb               2/2     Running   2          117m
session-v0-9-2-c8b4cbf9d-jzrm4               2/2     Running   2          117m
session-v0-9-2-c8b4cbf9d-ngpfx               2/2     Running   1          117m
session-v0-9-2-c8b4cbf9d-rhtft               2/2     Running   1          117m
session-v0-9-2-c8b4cbf9d-t47d5               2/2     Running   1          117m
session-v0-9-2-c8b4cbf9d-xddb9               2/2     Running   2          117m
session-v0-9-2-c8b4cbf9d-zbx98               2/2     Running   1          117m
erSitzt commented 3 years ago

The "wrong" pods still have the old, longer version labels... app.kubernetes.io/instance: session-v0-9-2-kubernetes2

and the correct one has the correctly changed label app.kubernetes.io/instance: session-v0-9-2

Btw. this is the same change in all those deployments, we remove the -kubernetes2 or something similar because my template was wrong before

cppforlife commented 3 years ago

i copied this from wrong-pod-session-...

  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/name: {}
    manager: Go-http-client
    operation: Update
    time: "2020-10-19T10:34:45Z"

note that you have some kind of k8s client ("Go-http-client"; may be some kind of script, operator, not sure) in your environment messing around with your pod. in this case it modified metadata.labels. interestingly enough ownerReferences filed was also removed (dont see in wrond pod YAML unlike in correct pod YAML). that's what throws off pod ownership.

(as a confirmation that it's not kapp, manager: kapp only shows up for the Deployment resource.)

erSitzt commented 3 years ago

May this be related ? https://github.com/kubernetes/kubernetes/issues/89080

erSitzt commented 3 years ago

my pipeline is just helm => kustomize => kapp

cppforlife commented 3 years ago

May this be related ? kubernetes/kubernetes#89080

that's not a problem though. that issue just describes somebody being unhappy with the length of managedFields. managed fields tells us who was modifying your resources on the cluster.

my pipeline is just helm => kustomize => kapp

that's just the CLI sides of things but you are running other software on the cluster that modifies things (like istio and calico, etc.). cool thing about managedFields is that it acts as a record of such modifications so we know for sure that something is modifying things (just not sure exactly who).

as another confirmation point, you can see that istio webhook actually saw original labels before they got changed by something else:

    - name: ISTIO_METAJSON_LABELS
      value: |
        {"app":"session","app.kubernetes.io/instance":"session-v0-9-2","app.kubernetes.io/name":"session-v0-9-2","kapp.k14s.io/app":"1602850253830284686","kapp.k14s.io/association":"v1.be85ce0b5dd112f8e421c1dff3eddedf","pod-template-hash":"c8b4cbf9d","version":"v0.9.2"}

i also noticed that in the correct pod you also have Go-http-client manager (in this case new label is being added). what is the software that is creating workloadID_* labels? it's very likely it's the one that's messing up other pod since that's the only manager name with non-unique name.

  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:workloadID_session-v0-9-2-metrics: {}
    manager: Go-http-client
    operation: Update
    time: "2020-10-19T10:34:45Z"
erSitzt commented 3 years ago

That could be rancher i guess... but im not using rancher ui to deploy anything, just using the k8s cluster endpoint directly

This are my deployment steps basically:

- kustomize create --autodetect --recursive .
- kustomize build | kubeval --ignore-missing-schemas --skip-kinds VirtualService,DestinationRule --force-color
- kustomize build | kapp --color --logs --diff-changes --apply-default-update-strategy fallback-on-replace --wait-timeout 5m -y deploy -n kapp-apps -a ${NS}-${DEPLOYMENTNAME} -f -

Everything else should only be the apiserver ?...

erSitzt commented 3 years ago

Yeah... looks like its rancher. I'm using an annotation to configure the custom metrics field.cattle.io/workloadMetrics: '[{"path":"/metrics","port":{{ .Values.service.port }},"schema":"HTTP"}]' I think thats whats creating f:workloadID_session-v0-9-2-metrics: {} and maybe messing up stuff ?

erSitzt commented 3 years ago

As this is quite the edge case and it's not causing that much trouble, it might be good enough to leave it as it is, for anyone to find who might be using a similar setup with rancher and kapp...

Thanks again @cppforlife for your time and effort !

cppforlife commented 3 years ago

who might be using a similar setup with rancher and kapp...

to be fully complete, this would happen with any Deployment. (kapp really here is unrelated).