Closed erSitzt closed 3 years ago
fokusartikel-v0-9-2-757c9576fb-dz7rv ist the correct one belonging to the deployment. The others stick around even if i remove the deployment or scale it.
kapp itself does not create pods directly. all its doing is creating/updating/deleting Deployment resource, and expects that Kubernetes's Deployment controller does the right thing.
12:32:38PM: L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-fsdsj (v1) namespace: neo 12:32:38PM: ^ Pending: PodInitializing 12:32:38PM: L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dz7rv (v1) namespace: neo 12:32:38PM: ^ Pending: PodInitializing 12:32:38PM: L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-dcp8w (v1) namespace: neo 12:32:38PM: ^ Pending: PodInitializing 12:32:38PM: L ongoing: waiting on pod/fokusartikel-v0-9-2-757c9576fb-d5ff7 (v1) namespace: neo 12:32:38PM: ^ Pending: PodInitializing
given above progress output and k8s Deployment naming conventions (they all share fokusartikel-v0-9-2-757c9576fb
as a prefix which means they all part of same replicaset) it seems you have a deployment that says spec.replicas=4. which is indeed what the progress log above is showing.
fokusartikel-v0-9-2-757c9576fb-dz7rv ist the correct one belonging to the deployment
what is the criteria of being "correct" here? are you saying that -fsdsj, -dz7rv, -d5ff7 shouldnt exist? if so you'll have to provide more YAML for us to look at here since to me this all looks normal.
It's a deployment with spec.replicas=1 and this only happens when we update a deployment where the labels changed, so it needs to be recreated. This never once happend using kapp with the same workflow when labels stay the same. I dont want to blame this on kapp doing something wrong, but maybe it is a timing issue when deleting / recreating the deployment, which results in pods that are "orphaned" ?
I'll try to get the yaml of the single pods next time this happens, to see if it differs from the pod shown in the deployment
I think I can provide the yaml that was deployed in one of these cases though.
So this is the deployment.yaml from the logs above...
---
# Source: fokusartikel-v0-9-2/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: fokusartikel-v0-9-2
namespace: neo
labels:
app.kubernetes.io/name: fokusartikel-v0-9-2
helm.sh/chart: fokusartikel-v0-9-2-0.1.0
app.kubernetes.io/instance: fokusartikel-v0-9-2
app.kubernetes.io/managed-by: Helm
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: fokusartikel-v0-9-2
app.kubernetes.io/instance: fokusartikel-v0-9-2
template:
metadata:
labels:
app.kubernetes.io/name: fokusartikel-v0-9-2
app.kubernetes.io/instance: fokusartikel-v0-9-2
app: fokusartikel
version: v0.9.2
annotations:
field.cattle.io/workloadMetrics: '[{"path":"/metrics","port":8888,"schema":"HTTP"}]'
kapp.k14s.io/deploy-logs: ""
kapp.k14s.io/deploy-logs-container-names: fokusartikel-v0-9-2
kapp.k14s.io/update-strategy: "fallback-on-replace"
spec:
serviceAccountName: neo
volumes:
- name: vault-token
emptyDir:
medium: Memory
- name: rendered-configs
emptyDir: {}
- name: vault-config
configMap:
name: fokusartikel-v0-9-2-consul-template-configs
items:
- key: vault.hcl
path: vault.hcl
- name: consul-templates
configMap:
name: fokusartikel-v0-9-2-configs
initContainers:
# Vault container
- name: vault-agent-auth
image: vault
volumeMounts:
- name: vault-config
mountPath: /etc/vault
- name: vault-token
mountPath: /home/vault
- name: consul-templates
mountPath: /configs
- name: rendered-configs
mountPath: /rendered-configs
env:
- name: APP_VERSION
value: v0.9.2
- name: CLUSTER_ENV
value: netde-prod
- name: VAULT_ADDR
value: https://vault.mydomain.com
- name: HOME
value: /home/vault
args:
[
"agent",
"-config=/etc/vault/vault.hcl",
"-log-level=debug",
]
containers:
- name: fokusartikel-v0-9-2
image: "srv-nexus-docker-registry.mydomain.com/neo/fokusartikel-neo:v0.9.2-kubernetes2-29432"
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8888
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: http
readinessProbe:
httpGet:
path: /health
port: http
volumeMounts:
- name: rendered-configs
mountPath: /usr/src/app/config/env.js
subPath: usr/src/app/config/env.js
env:
- name: APP_VERSION
value: v0.9.2
- name: CLUSTER_ENV
value: netde-prod
- name: METRIC_DEBUG
value: "false"
imagePullSecrets:
- name: expertreg
i dont see anything unusual about your Deployment. could you include the Deployment YAML output from cluster (i see that this is output from helm template). may be you have something else changing spec.replicas to be 3 (like an HPA).
This never once happend using kapp with the same workflow when labels stay the same. I dont want to blame this on kapp doing something wrong, but maybe it is a timing issue when deleting / recreating the deployment, which results in pods that are "orphaned" ?
does this happen consistently for you? it cant really be a timing issue since k8s is by nature converging resources so it would notice the disparity of pods not having "owning" deployment.
just to try it outin my env, i've created Deployment [0] and updated its selector which made kapp recreate it in the next deploy. i saw that old pods were terminated (slightly after new pods came up) by k8s and in parallel same pods got created.
[0]
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: default
name: app
spec:
selector:
matchLabels:
simple-app2: ""
template:
metadata:
labels:
simple-app2: ""
spec:
containers:
- name: simple-app
image: docker.io/dkalinin/k8s-simple-app@sha256:4c8b96d4fffdfae29258d94a22ae4ad1fe36139d47288b8960d9958d1e63a9d0
env:
- name: HELLO_MSG
value: foo
it startet to happen quite often as we started changing our labels... but im not sure it happens every time. Some of my deployments already had the annotation for fallback, some didnt, so i added the default option for fallback-on-replace to the cmd. I think after that i saw it for the first time...
btw unrelated to above issue you have kapp.k14s.io/update-strategy: "fallback-on-replace"
added to your pod metadata instead of your Deployment in your above YAML snippet.
it startet to happen quite often as we started changing our labels... but im not sure it happens every time.
hmm, yeah if it doesnt happen consistently im not sure what would be our next steps since this functionality is in k8s itself. i would be happy to take a look at your environment over zoom if this issue does occur again. alternatively feel free to dump deployment, replicaset, pod yamls via kubectl get pod,rs,deploy -oyaml
and we can take a look at that instead.
Ok, here is the output of the latest occurrence
I have included the yaml of one of the wrong pods... this is what happend in total
❯ kubectl get deployments.apps -n neo session-v0-9-2
NAME READY UP-TO-DATE AVAILABLE AGE
session-v0-9-2 1/1 1 1 117m
❯ kubectl get rs -n neo | grep session
session-v0-9-2-c8b4cbf9d 1 1 1 117m
❯ kubectl get pods -n neo | grep session
session-v0-9-2-c8b4cbf9d-6pb5z 2/2 Running 2 117m
session-v0-9-2-c8b4cbf9d-jdtjb 2/2 Running 2 117m
session-v0-9-2-c8b4cbf9d-jzrm4 2/2 Running 2 117m
session-v0-9-2-c8b4cbf9d-ngpfx 2/2 Running 1 117m
session-v0-9-2-c8b4cbf9d-rhtft 2/2 Running 1 117m
session-v0-9-2-c8b4cbf9d-t47d5 2/2 Running 1 117m
session-v0-9-2-c8b4cbf9d-xddb9 2/2 Running 2 117m
session-v0-9-2-c8b4cbf9d-zbx98 2/2 Running 1 117m
The "wrong" pods still have the old, longer version labels...
app.kubernetes.io/instance: session-v0-9-2-kubernetes2
and the correct one has the correctly changed label
app.kubernetes.io/instance: session-v0-9-2
Btw. this is the same change in all those deployments, we remove the -kubernetes2
or something similar because my template was wrong before
i copied this from wrong-pod-session-...
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
f:app.kubernetes.io/instance: {}
f:app.kubernetes.io/name: {}
manager: Go-http-client
operation: Update
time: "2020-10-19T10:34:45Z"
note that you have some kind of k8s client ("Go-http-client"; may be some kind of script, operator, not sure) in your environment messing around with your pod. in this case it modified metadata.labels. interestingly enough ownerReferences filed was also removed (dont see in wrond pod YAML unlike in correct pod YAML). that's what throws off pod ownership.
(as a confirmation that it's not kapp, manager: kapp
only shows up for the Deployment resource.)
May this be related ? https://github.com/kubernetes/kubernetes/issues/89080
my pipeline is just helm => kustomize => kapp
May this be related ? kubernetes/kubernetes#89080
that's not a problem though. that issue just describes somebody being unhappy with the length of managedFields. managed fields tells us who was modifying your resources on the cluster.
my pipeline is just helm => kustomize => kapp
that's just the CLI sides of things but you are running other software on the cluster that modifies things (like istio and calico, etc.). cool thing about managedFields is that it acts as a record of such modifications so we know for sure that something is modifying things (just not sure exactly who).
as another confirmation point, you can see that istio webhook actually saw original labels before they got changed by something else:
- name: ISTIO_METAJSON_LABELS
value: |
{"app":"session","app.kubernetes.io/instance":"session-v0-9-2","app.kubernetes.io/name":"session-v0-9-2","kapp.k14s.io/app":"1602850253830284686","kapp.k14s.io/association":"v1.be85ce0b5dd112f8e421c1dff3eddedf","pod-template-hash":"c8b4cbf9d","version":"v0.9.2"}
i also noticed that in the correct pod you also have Go-http-client
manager (in this case new label is being added). what is the software that is creating workloadID_*
labels? it's very likely it's the one that's messing up other pod since that's the only manager name with non-unique name.
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
f:workloadID_session-v0-9-2-metrics: {}
manager: Go-http-client
operation: Update
time: "2020-10-19T10:34:45Z"
That could be rancher i guess... but im not using rancher ui to deploy anything, just using the k8s cluster endpoint directly
This are my deployment steps basically:
- kustomize create --autodetect --recursive .
- kustomize build | kubeval --ignore-missing-schemas --skip-kinds VirtualService,DestinationRule --force-color
- kustomize build | kapp --color --logs --diff-changes --apply-default-update-strategy fallback-on-replace --wait-timeout 5m -y deploy -n kapp-apps -a ${NS}-${DEPLOYMENTNAME} -f -
Everything else should only be the apiserver ?...
Yeah... looks like its rancher.
I'm using an annotation to configure the custom metrics
field.cattle.io/workloadMetrics: '[{"path":"/metrics","port":{{ .Values.service.port }},"schema":"HTTP"}]'
I think thats whats creating f:workloadID_session-v0-9-2-metrics: {}
and maybe messing up stuff ?
As this is quite the edge case and it's not causing that much trouble, it might be good enough to leave it as it is, for anyone to find who might be using a similar setup with rancher and kapp...
Thanks again @cppforlife for your time and effort !
who might be using a similar setup with rancher and kapp...
to be fully complete, this would happen with any Deployment. (kapp really here is unrelated).
Hi,
we had a lot of deployments in kubernetes with labels that needed a correction, so we startet to use the annotation "update-strategy" and the cmd arg
--apply-default-update-strategy fallback-on-replace
to avoid lots of thoseimmutable
errors.But now im seeing a behavior where kapp does not create/update the deployment, but does create additional PODs next to the deployment.
`fokusartikel-v0-9-2``is my updated deployment
And these PODs are running after the update
fokusartikel-v0-9-2-757c9576fb-dz7rv
ist the correct one belonging to the deploymentThe others stick around even if i remove the deployment or scale it.
This is the kapp-log from my deployment
Target Cluster is Kubernetes
v1.16.8
but my colleagues had the same issue inv1.18.8