adfinis / openshift-etcd-backup

CronJob to perform an etcd backup inside an OpenShift cluster
GNU Affero General Public License v3.0
17 stars 18 forks source link

Pod Placement #51

Closed ggrames closed 9 months ago

ggrames commented 1 year ago

Hi,

is this solution also working for openshift 4.9 and higher? Because i have some problems concerning pod placement: x node(s) didn't match Pod's node affinity/selector. It results in a Pending Job Instance

Thank you for the info

tongpu commented 1 year ago

I just checked one of our clusters and the labels and tolerations match the configuration in backup-cronjob.yaml.

Could you provide the output of oc get nodes --show-labels so that we can compare the output.

ggrames commented 1 year ago

Sorry for the delay

ocp-compute-01.my.domain.at   Ready    worker   2y298d   v1.22.8+f34b40c   allow-kafka-broker=true,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-compute-01.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
ocp-compute-02.my.domain.at   Ready    worker   2y298d   v1.22.8+f34b40c   allow-kafka-broker=true,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-compute-02.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
ocp-compute-03.my.domain.at   Ready    worker   2y298d   v1.22.8+f34b40c   allow-kafka-broker=true,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-compute-03.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
ocp-compute-04.my.domain.at   Ready    worker   2y298d   v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-compute-04.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
ocp-compute-05.my.domain.at   Ready    worker   2y298d   v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-compute-05.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
ocp-compute-06.my.domain.at   Ready    worker   2y298d   v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-compute-06.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos
ocp-control-01.my.domain.at   Ready    master   2y298d   v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-control-01.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos
ocp-control-02.my.domain.at   Ready    master   2y298d   v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-control-02.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos
ocp-control-03.my.domain.at   Ready    master   2y298d   v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-control-03.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos
ocp-infra-01.my.domain.at     Ready    infra    2y298d   v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-infra-01.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node.openshift.io/os_id=rhcos
ocp-infra-02.my.domain.at     Ready    infra    2y298d   v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-infra-02.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node.openshift.io/os_id=rhcos
ocp-infra-03.my.domain.at     Ready    infra    2y298d   v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-infra-03.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node.openshift.io/os_id=rhcos
ocp-infra-04.my.domain.at     Ready    infra    8d       v1.22.8+f34b40c   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,etcdbackup=allowed,kubernetes.io/arch=amd64,kubernetes.io/hostname=ocp-infra-04.my.domain.at,kubernetes.io/os=linux,node-role.kubernetes.io/infra=,node.openshift.io/os_id=rhcos
tongpu commented 1 year ago

Looking at the output the node-role.kubernetes.io/master= label is present on all of the ocp-control-* nodes so that should be fine. What is the exact error message you see in the events?

ggrames commented 1 year ago

24s Warning FailedScheduling pod/etcd-backup-manual-2023-03-02-10-10-47--1-6krt9 0/13 nodes are available: 13 node(s) didn't match Pod's node affinity/selector.

ggrames commented 1 year ago

Maybe there are more restictions on the 4.9 version of the cluster than on the 4.7

tongpu commented 1 year ago

Can you paste the full yaml of the pod pod/etcd-backup-manual-2023-03-02-10-10-47--1-6krt9. Should be possible with oc get pod/etcd-backup-manual-2023-03-02-10-10-47--1-6krt9 -o yaml.

ggrames commented 1 year ago
apiVersion: v1
kind: Pod
metadata:
  annotations:
    openshift.io/scc: privileged
  creationTimestamp: "2023-03-02T09:10:52Z"
  generateName: etcd-backup-manual-2023-03-02-10-10-47--1-
  labels:
    controller-uid: 5fa59092-3471-4594-9800-2367542578ab
    job-name: etcd-backup-manual-2023-03-02-10-10-47
  name: etcd-backup-manual-2023-03-02-10-10-47--1-6krt9
  namespace: infra-services
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: etcd-backup-manual-2023-03-02-10-10-47
    uid: 5fa59092-3471-4594-9800-2367542578ab
  resourceVersion: "1381746391"
  uid: b0aa7e9d-da92-4dd0-9169-841cfad575d3
spec:
  containers:
  - command:
    - /bin/sh
    - /usr/local/bin/backup.sh
    envFrom:
    - configMapRef:
        name: backup-config
    image: ghcr.io/adfinis/openshift-etcd-backup
    imagePullPolicy: Always
    name: backup-etcd
    resources:
      limits:
        cpu: "1"
        memory: 512Mi
      requests:
        cpu: 500m
        memory: 128Mi
    securityContext:
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host
      name: host
    - mountPath: /backup
      name: volume-backup
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xsb2x
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  hostPID: true
  imagePullSecrets:
  - name: etcd-backup-dockercfg-62t5j
  nodeSelector:
    node-role.kubernetes.io/master: ""
    node-role.kubernetes.io/worker: ""
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: etcd-backup
  serviceAccountName: etcd-backup
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - hostPath:
      path: /
      type: Directory
    name: host
  - name: volume-backup
    persistentVolumeClaim:
      claimName: etcd-backup-pvc
  - name: kube-api-access-xsb2x
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-03-02T09:10:52Z"
    message: '0/13 nodes are available: 13 node(s) didn''t match Pod''s node affinity/selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable
tongpu commented 1 year ago

Looking at the pod YAML I see that you have two node selectors that contradict each other:

nodeSelector:
  node-role.kubernetes.io/master: ""
  node-role.kubernetes.io/worker: ""

I would assume that you're starting the cron job in a namespace which has the an openshift.io/node-selector annotation which adds the node selector for the worker:

apiVersion: v1
kind: Namespace
metadata:
  name: example
  annotations:
    openshift.io/node-selector: node-role.kubernetes.io/worker=""
ggrames commented 1 year ago

Ok, thank you i will give it a try I will be able to test this on monday

tongpu commented 1 year ago

Any feedback on this? Was I able to guide you to a fix for your problem?

ggrames commented 1 year ago

Hi, At the moment it is still not working. But maybe it is a general prob in my cluster, because also gitops Pods have problems with pod placements. I habe a question open at Redhat. I will inform you Thank you at the moment