Closed ddelange closed 1 year ago
correction: the cluster boots succesfully with my dual architecture setup, but i think that's also a bug:
with that setup, the prmr
pods will end up with only one nodeaffinity, and so there will still be no pods going onto the arm nodes:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- amd64
when i switch the order in spec.architecture
, we're back at the original issue, where the temporary pod doesn't schedule. it seems that only the first entry in spec.architecture
is respected:
0/13 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 8 node(s) had taint {arch: arm64}, that the pod didn't tolerate.
fwiw in the original implementation there was an error "Only one architecture type is supported currently"
, but that seems to have disappeared before/during/after merging that PR
so I now hacked our toleration
- effect: NoSchedule
key: arch
operator: Equal
value: arm64
into the spec of the temporary pod and the cluster now successfully booted on arm
when i switch the order in
spec.architecture
, we're back at the original issue, where the temporary pod doesn't schedule. it seems that only the first entry inspec.architecture
is respected:0/13 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 8 node(s) had taint {arch: arm64}, that the pod didn't tolerate.
opened https://github.com/arangodb/kube-arangodb/issues/1140
cc @jwierzbo from https://github.com/arangodb/arangodb-docker/issues/53#issuecomment-1256284798
I guess this issue should live in kube-arangodb actually 😅 should I close and re-open there?
yes, I think as long as our docker container doesn't anything wrong about this topic, this issue doesn't belong here, and it should be discussed inside of the http://github.com/arangodb/kube-arangodb repo. The multiarch support of the container should be working properly as of ArangoDB 3.10, right?
yes, all green until now! I'll re-open there
closing in favor of https://github.com/arangodb/kube-arangodb/issues/1141
Hi! :wave:
I'm trying to deploy an ArangoDeployment to a mixed amd/arm cluster, where the arm nodes have a no schedule taint.
When I boot a fresh cluster with
spec.architecture: [arm64]
, there will be a temporaryarangodb-cluster-id-xxx
pod created, which has hard NodeAffinity for arm64, but no way to add tolerations/nodeSelector. This means the pod won't be scheduled on a node, and the cluster boot sequence will hang.When I change to
spec.architecture: [amd64, arm64]
, the temporary pod will be allowed to schedule on an amd node in the cluster, ~and the rest of the pods will have the toleration for arm and so will schedule on the arm nodes~ (as we also have a prefer no schedule taint on the amd nodes). This is an acceptable workaround for now, but I'd rather specifyspec.architecture: [arm64]
and get a successful boot.arangodb-cluster-id-162f0e
```yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: "2022-09-23T07:48:45Z" labels: app: arangodb arango_deployment: arangodb-cluster deployment.arangodb.com/member: 162f0e role: id managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:app: {} f:arango_deployment: {} f:deployment.arangodb.com/member: {} f:role: {} f:ownerReferences: .: {} k:{"uid":"6054f633-c606-4f75-b7ee-a28430e1b952"}: {} f:spec: f:affinity: .: {} f:nodeAffinity: .: {} f:requiredDuringSchedulingIgnoredDuringExecution: {} f:podAntiAffinity: .: {} f:preferredDuringSchedulingIgnoredDuringExecution: {} f:containers: k:{"name":"server"}: .: {} f:command: {} f:image: {} f:imagePullPolicy: {} f:name: {} f:ports: .: {} k:{"containerPort":8529,"protocol":"TCP"}: .: {} f:containerPort: {} f:name: {} f:protocol: {} f:resources: {} f:securityContext: .: {} f:capabilities: .: {} f:drop: {} f:terminationMessagePath: {} f:terminationMessagePolicy: {} f:volumeMounts: .: {} k:{"mountPath":"/data"}: .: {} f:mountPath: {} f:name: {} f:dnsPolicy: {} f:enableServiceLinks: {} f:hostname: {} f:restartPolicy: {} f:schedulerName: {} f:securityContext: {} f:subdomain: {} f:terminationGracePeriodSeconds: {} f:tolerations: {} f:volumes: .: {} k:{"name":"arangod-data"}: .: {} f:emptyDir: {} f:name: {} manager: arangodb_operator operation: Update time: "2022-09-23T07:48:45Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:status: f:conditions: .: {} k:{"type":"PodScheduled"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:message: {} f:reason: {} f:status: {} f:type: {} manager: kube-scheduler operation: Update subresource: status time: "2022-09-23T07:48:45Z" name: arangodb-cluster-id-162f0e namespace: aa-data-api ownerReferences: - apiVersion: database.arangodb.com/v1 controller: true kind: ArangoDeployment name: arangodb-cluster uid: 6054f633-c606-4f75-b7ee-a28430e1b952 resourceVersion: "109818868" uid: cf969e3d-6bc1-4395-a774-070b6b629c1e spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/arch operator: In values: - arm64 podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app: arangodb arango_deployment: arangodb-cluster role: id topologyKey: kubernetes.io/hostname weight: 1 containers: - command: - /usr/sbin/arangod - --server.authentication=false - --server.endpoint=tcp://[::]:8529 - --database.directory=/data - --log.output=+ image: arangodb/arangodb-preview:3.10.0-beta.1 imagePullPolicy: IfNotPresent name: server ports: - containerPort: 8529 name: server protocol: TCP resources: {} securityContext: capabilities: drop: - ALL terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /data name: arangod-data - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-mzlkl readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true hostname: arangodb-cluster-id-162f0e preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default subdomain: arangodb-cluster-int terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 5 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 5 - effect: NoExecute key: node.alpha.kubernetes.io/unreachable operator: Exists tolerationSeconds: 5 volumes: - emptyDir: {} name: arangod-data - name: kube-api-access-mzlkl projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace status: conditions: - lastProbeTime: null lastTransitionTime: "2022-09-23T07:48:45Z" message: '0/22 nodes are available: 17 node(s) had taint {arch: arm64}, that the pod didn''t tolerate, 2 node(s) didn''t match Pod''s node affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending qosClass: BestEffort ```
Originally posted by @ddelange in https://github.com/arangodb/arangodb-docker/issues/53#issuecomment-1255914415