argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.2k forks source link

Fail to prioritize some nodepool using preferredDuringSchedulingIgnoredDuringExecution #13924

Open BastienMac opened 6 hours ago

BastienMac commented 6 hours ago

Pre-requisites

What happened? What did you expect to happen?

Hello,

What I want to do with argo workflow : I want to lauch my pod on node from specific nodepool in priority. To do that I'm trying to use affinity>nodeAffinity>preferredDuringSchedulingIgnoredDuringExecution

Below an extract of my workflow. The goal is to make deploy the autoscaling node in my favorite order regarding of their nodepool : k8s-asp-dev-pool-var-b2-15 > k8s-asp-dev-pool-var-b2-30 > k8s-asp-dev-pool-fix-r2-120 > k8s-asp-dev-pool-var-r2-120 > ... > k8s-asp-dev-pool-var-b2-120 > k8s-asp-dev-pool-var-c2-120

spec:
  templates:
    - name: start
      dag:
        tasks:
          - name: sleep-r2-120-1
            template: t-sleep-affinity
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            withParam: >-
              [{"id" : "node1"}, {"id" : "node2"}, {"id" : "node3"}, {"id" : "node4"}, {"id" : "node5"}, {"id" : "node6"}, {"id" : "node7"}, {"id" : "node8"}, {"id" : "node9"}, {"id" : "node10"}, {"id" : "node11"}, {"id" : "node12"}, {"id" : "node13"}, {"id" :"node14"}, {"id" : "node15"}, {"id" : "node16"}]
          - name: sleep-r2-120-2
            template: t-sleep-affinity
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            withParam: >-
              [{"id" : "node17"}, {"id" : "node18"}, {"id" : "node19"}, {"id" :"node20"}, {"id" : "node21"}, {"id" : "node22"}, {"id" : "node23"}, {"id" : "node24"}, {"id" : "node25"}, {"id" : "node26"}, {"id" : "node27"}, {"id" : "node28"}, {"id" :"node29"}, {"id" : "node30"}, {"id" : "node31"}, {"id" : "node32"}, {"id" : "node33"}]

          - name: sleep-tempo-8mn
            template: t-sleep
            arguments:
              parameters:
                - name: id
                  value: 'tempo-8mn'
                - name: time-second
                  value: '480'
          - name: sleep-r2-240
            template: t-sleep-affinity
            dependencies:
              - sleep-tempo-8mn-1
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            withParam: >-
              [{"id" : "node49"}, {"id" : "node50"}, {"id" : "node51"}, {"id" :"node52"}, {"id" : "node53"}, {"id" : "node54"}, {"id" : "node55"},{"id" : "node56"}, {"id" : "node57"}, {"id" : "node58"}, {"id" :"node59"}, {"id" : "node60"}, {"id" : "node61"}, {"id" :"node62"}, {"id" : "node63"}, {"id" : "node64"}, {"id" :"node65"}, {"id" : "node66"}, {"id" : "node67"}]
....
    - name: t-sleep-affinity
      inputs:
        parameters:
          - name: id
          - name: time-second
      outputs: {}
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 99
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - ' k8s-asp-dev-pool-var-b2-15'
            - weight: 89
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - 'k8s-asp-dev-pool-var-b2-30'
            - weight: 79
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - 'k8s-asp-dev-pool-fix-r2-120'
            - weight: 69
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - 'k8s-asp-dev-pool-var-r2-120'
            - weight: 59
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - 'k8s-asp-dev-pool-fix'
            - weight: 49
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - ' k8s-asp-dev-pool-var'
            - weight: 39
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - 'k8s-asp-dev-pool-var-b2-120'
            - weight: 9
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - 'k8s-asp-dev-pool-var-c2-120'
      script:
        name: ''
        image: >-
          <image name>
        command:
          - python
        resources:
          limits:
            cpu: '5'
            memory: 50Gi
          requests:
            cpu: '5'
            memory: 40Gi
        imagePullPolicy: Always
        source: |
          import time
          time.sleep(int("{{inputs.parameters.time-second}}"))

What I got : When I launch my workflow the node deployed were not from the expected nodepool upscaling VM As you can see on the picture attached the node are deployed from the nodepool k8s-asp-dev-pool-var and and k8s-asp-dev-pool-var-c2-60 (which is not even in the list of desired nodepool). The exepect nodepool were k8s-asp-dev-pool-var-r2-120, which is the one with the biggest weight, or at least k8s-asp-dev-pool-var-c2-120 which is the one with the smallest weight

Can you see what I did wrong ?

I did some test with requiredDuringSchedulingIgnoredDuringExecution instead of preferredDuringSchedulingIgnoredDuringExecution and it worked without any problem. But it is not want I want to implement

About my conf : about kubectl kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"f66044f4361b9f1f96f0053dd46cb7dce5e990a8", GitTreeState:"clean", BuildDate:"2022-06-15T14:22:29Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.16", GitCommit:"c5f43560a4f98f2af3743a59299fb79f07924373", GitTreeState:"clean", BuildDate:"2023-11-15T22:28:05Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

about argo-workflow we are in 3.4.4

Thanks for your help

Version(s)

v3.4.4

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

metadata:
  name: mbo-bma-workflow-test-nodepull-affinity-test
  namespace: dev
  uid: 914fd484-2136-47a9-9c5b-e7820664ab60
  resourceVersion: '17902321915'
  generation: 1
  creationTimestamp: '2024-11-21T14:23:40Z'
  labels:
    workflows.argoproj.io/creator: system-serviceaccount-dev-argo-workflows-argo-server
  managedFields:
    - manager: argo
      operation: Update
      apiVersion: argoproj.io/v1alpha1
      time: '2024-11-21T14:23:40Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            .: {}
            f:workflows.argoproj.io/creator: {}
        f:spec: {}
spec:
  templates:
    - name: debut
      inputs: {}
      outputs: {}
      metadata: {}
      dag:
        tasks:
          - name: sleep-r2-120-1
            template: t-sleep-affinity
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            withParam: >-
              [{"id" : "node1"}, {"id" : "node2"}, {"id" : "node3"}, {"id" :
              "node4"}, {"id" : "node5"}, {"id" : "node6"}, {"id" : "node7"},
              {"id" : "node8"}, {"id" : "node9"}, {"id" : "node10"}, {"id" :
              "node11"}, {"id" : "node12"}, {"id" : "node13"}, {"id" :"node14"},
              {"id" : "node15"}, {"id" : "node16"}]
          - name: sleep-r2-120-2
            template: t-sleep-affinity
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            withParam: >-
              [{"id" : "node17"}, {"id" : "node18"}, {"id" : "node19"}, {"id"
              :"node20"}, {"id" : "node21"}, {"id" : "node22"}, {"id" :
              "node23"}, {"id" : "node24"}, {"id" : "node25"}, {"id" :
              "node26"}, {"id" : "node27"}, {"id" : "node28"}, {"id" :"node29"},
              {"id" : "node30"}, {"id" : "node31"}, {"id" : "node32"}, {"id" :
              "node33"}]
          - name: sleep-r2-120-3
            template: t-sleep-affinity
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            withParam: >-
              [{"id" : "node33"}, {"id" : "node34"}, {"id" : "node35"}, {"id"
              :"node36"}, {"id" : "node37"}, {"id" : "node38"}, {"id"
              :"node39"}, {"id" : "node40"}, {"id" : "node41"}, {"id"
              :"node45"}, {"id" : "node46"}, {"id" : "node47"}, {"id" :
              "node48"}]
          - name: sleep-b2-15
            template: t-sleep-affinity-small
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            withParam: >-
              [{"id" : "node76"}, {"id" : "node77"}, {"id" : "node78"}, {"id"
              :"node79"}, {"id" : "node80"}, {"id" : "node81"}, {"id" :
              "node82"},{"id" : "node83"}, {"id" : "node84"}, {"id" : "node85"},
              {"id" :"node86"}]
          - name: sleep-tempo-8mn-1
            template: t-sleep
            arguments:
              parameters:
                - name: id
                  value: tempo-8mn
                - name: time-second
                  value: '480'
          - name: sleep-b2-30
            template: t-sleep-affinity-small
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            dependencies:
              - sleep-tempo-8mn-1
            withParam: >-
              [{"id" : "node87"}, {"id" : "node88"}, {"id" : "node89"}, {"id"
              :"node90"}, {"id" : "node91"}, {"id" : "node92"}, {"id" :
              "node93"},{"id" : "node94"}, {"id" : "node95"}, {"id" : "node96"},
              {"id" :"node97"}]
          - name: sleep-r2-240
            template: t-sleep-affinity
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            dependencies:
              - sleep-tempo-8mn-1
            withParam: >-
              [{"id" : "node49"}, {"id" : "node50"}, {"id" : "node51"}, {"id"
              :"node52"}, {"id" : "node53"}, {"id" : "node54"}, {"id" :
              "node55"},{"id" : "node56"}, {"id" : "node57"}, {"id" : "node58"},
              {"id" :"node59"}, {"id" : "node60"}, {"id" : "node61"}, {"id"
              :"node62"}, {"id" : "node63"}, {"id" : "node64"}, {"id"
              :"node65"}, {"id" : "node66"}, {"id" : "node67"}]
          - name: sleep-tempo-8mn-2
            template: t-sleep
            arguments:
              parameters:
                - name: id
                  value: tempo-8mn
                - name: time-second
                  value: '480'
            dependencies:
              - sleep-tempo-8mn-1
          - name: sleep-b2-120
            template: t-sleep-affinity
            arguments:
              parameters:
                - name: id
                  value: '{{item.id}}'
                - name: time-second
                  value: '{{workflow.parameters.wait-time-second}}'
            dependencies:
              - sleep-tempo-8mn-2
            withParam: >-
              [{"id" : "node68"}, {"id" : "node69"}, {"id" : "node70"}, {"id"
              :"node71"}, {"id" : "node72"}, {"id" : "node73"}, {"id" :
              "node74"},{"id" : "node75"}, {"id" : "node75"}]
    - name: t-sleep
      inputs:
        parameters:
          - name: id
          - name: time-second
      outputs: {}
      metadata: {}
      script:
        name: ''
        image: >-
          ljyq7dvr.gra7.container-registry.ovh.net/staging/gen_perimetre_acquisition:2.1.324447
        command:
          - python
        resources:
          limits:
            cpu: '1'
            memory: 2Gi
          requests:
            cpu: '1'
            memory: 1Gi
        imagePullPolicy: Always
        source: |
          import time
          time.sleep(int("{{inputs.parameters.time-second}}"))
    - name: t-sleep-affinity
      inputs:
        parameters:
          - name: id
          - name: time-second
      outputs: {}
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 99
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-b2-15
            - weight: 89
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-b2-30
            - weight: 79
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-fix-r2-120
            - weight: 69
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-r2-120
            - weight: 59
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-fix
            - weight: 49
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var
            - weight: 39
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-b2-120
            - weight: 9
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-c2-120
      metadata: {}
      script:
        name: ''
        image: >-
          ljyq7dvr.gra7.container-registry.ovh.net/staging/gen_perimetre_acquisition:2.1.324447
        command:
          - python
        resources:
          limits:
            cpu: '5'
            memory: 50Gi
          requests:
            cpu: '5'
            memory: 40Gi
        imagePullPolicy: Always
        source: |
          import time
          time.sleep(int("{{inputs.parameters.time-second}}"))
    - name: t-sleep-affinity-small
      inputs:
        parameters:
          - name: id
          - name: time-second
      outputs: {}
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 99
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-b2-15
            - weight: 89
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-b2-30
            - weight: 79
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-fix-r2-120
            - weight: 69
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-r2-120
            - weight: 59
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-fix
            - weight: 49
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var
            - weight: 39
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-b2-120
            - weight: 9
              preference:
                matchExpressions:
                  - key: nodepool
                    operator: In
                    values:
                      - k8s-asp-dev-pool-var-c2-120
      metadata: {}
      script:
        name: ''
        image: >-
          ljyq7dvr.gra7.container-registry.ovh.net/staging/gen_perimetre_acquisition:2.1.324447
        command:
          - python
        resources:
          limits:
            cpu: '2'
            memory: 8Gi
          requests:
            cpu: '2'
            memory: 6Gi
        imagePullPolicy: Always
        source: |
          import time
          time.sleep(int("{{inputs.parameters.time-second}}"))
  entrypoint: debut
  arguments:
    parameters:
      - name: wait-time-second
        value: '1500'
  volumes:
    - name: v-output-0
      emptyDir: {}
    - name: v-output-1
      emptyDir: {}
    - name: v-output-2
      emptyDir: {}
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

I'll updated this later

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

I'll updated this later