flant / addon-operator

A system to manage additional components for Kubernetes cluster in a simple, consistent and automated way.
https://flant.github.io/addon-operator/
Apache License 2.0
483 stars 27 forks source link

Execute ModuleRun tasks of the same weight in parallel #504

Closed miklezzzz closed 2 weeks ago

miklezzzz commented 1 month ago

Overview

ModuleRun tasks and corresponding ModuleHookRun tasks for modules of the same order (weight) are executed in parallel in parallel queues. There are 10 parallel queues by default in the operator's queue set.

What this PR does / why we need it

This pr adds new type of tasks - ParallelModuleRun. A task of this type represents a group of smaller tasks with the same order/weight of ModuleRun and ModuleHookRun types. These subordinate tasks are executed in parallel pre-created named parallel_queue_x queues and all the results and errors are propagated back to the corresponding ParallelModuleRun task that updates its status accordingly.

Special notes for your reviewer

miklezzzz commented 1 month ago

an example of grouped (parallel) run:

Queue 'main': length 31, status: 'run first task'

 1. GroupedModuleRun:main:Grouped run for cloud-data-crd, metallb-crd, operator-prometheus-crd, prometheus-crd, snapshot-controller-crd, user-authn-crd, vertical-pod-autoscaler-crd:OperatorStartup
 2. ModuleRun:main:flow-schema:doStartup:OperatorStartup
 3. ModuleRun:main:admission-policy-engine:doStartup:OperatorStartup
 4. ModuleRun:main:cloud-provider-openstack:doStartup:OperatorStartup
 5. ModuleRun:main:local-path-provisioner:doStartup:OperatorStartup
 6. ModuleRun:main:cni-flannel:doStartup:OperatorStartup
 7. ModuleRun:main:kube-proxy:doStartup:OperatorStartup
 8. ModuleRun:main:registry-packages-proxy:doStartup:OperatorStartup
 9. GroupedModuleRun:main:Grouped run for control-plane-manager, node-manager, terraform-manager:OperatorStartup
10. ModuleRun:main:kube-dns:doStartup:OperatorStartup
11. ModuleRun:main:snapshot-controller:doStartup:OperatorStartup
12. ModuleRun:main:cert-manager:doStartup:OperatorStartup
13. ModuleRun:main:user-authz:doStartup:OperatorStartup
14. ModuleRun:main:user-authn:doStartup:OperatorStartup
15. ModuleRun:main:operator-prometheus:doStartup:OperatorStartup
16. ModuleRun:main:prometheus:doStartup:OperatorStartup
17. ModuleRun:main:prometheus-metrics-adapter:doStartup:OperatorStartup
18. ModuleRun:main:vertical-pod-autoscaler:doStartup:OperatorStartup
19. GroupedModuleRun:main:Grouped run for extended-monitoring, monitoring-applications, monitoring-custom, monitoring-deckhouse, monitoring-kubernetes, monitoring-kubernetes-control-plane, monitoring-ping:OperatorStartup
20. ModuleRun:main:node-local-dns:doStartup:OperatorStartup
21. ModuleRun:main:ingress-nginx:doStartup:OperatorStartup
22. ModuleRun:main:log-shipper:doStartup:OperatorStartup
23. ModuleRun:main:pod-reloader:doStartup:OperatorStartup
24. ModuleRun:main:chrony:doStartup:OperatorStartup
25. GroupedModuleRun:main:Grouped run for dashboard, operator-trivy, upmeter:OperatorStartup
26. GroupedModuleRun:main:Grouped run for namespace-configurator, secret-copier:OperatorStartup
27. ModuleRun:main:deckhouse-tools:doStartup:OperatorStartup
28. ModuleRun:main:documentation:doStartup:OperatorStartup
29. GroupedModuleRun:main:Grouped run for echo, mcplay:OperatorStartup
30. ConvergeModules:main:::Operator-Startup
31. ModuleHookRun:main:kubernetes:002-deckhouse/hooks/change_host_ip.go:pod:Kubernetes

Queue 'group_queue_0': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_0:cloud-data-crd:doStartup:OperatorStartup

Queue 'group_queue_1': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_1:metallb-crd:doStartup:OperatorStartup

Queue 'group_queue_2': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_2:operator-prometheus-crd:doStartup:OperatorStartup

Queue 'group_queue_3': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_3:prometheus-crd:doStartup:OperatorStartup

Queue 'group_queue_4': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_4:snapshot-controller-crd:doStartup:OperatorStartup

Queue 'group_queue_5': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_5:user-authn-crd:doStartup:OperatorStartup

Queue 'group_queue_6': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_6:vertical-pod-autoscaler-crd:doStartup:OperatorStartup

Summary:
- 'main' queue: 31 tasks.
- 14 other queues (7 active, 7 empty): 7 tasks.
- total 38 tasks to handle.
miklezzzz commented 1 month ago

a failed task in a grouped run

Queue 'main': length 8, status: 'run first task'

 1. GroupedModuleRun:main:Grouped run for mcplay:OperatorStartup:failures 1:
    Errors:
    - mcplay: helm upgrade failed: cannot patch "mcplay" with kind Deployment: Deployment.apps "mcplay" is invalid: spec.template.spec.containers: Required value

 2. ConvergeModules:main:::Operator-Startup
 3. ModuleHookRun:main:kubernetes:002-deckhouse/hooks/change_host_ip.go:pod:Kubernetes
 4. ModuleRun:main:node-manager:Kubernetes-Change-ModuleValues
 5. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes
 6. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
 7. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
 8. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes

Queue 'group_queue_1': length 1, status: 'run first task'

 1. ModuleRun:group_queue_1:mcplay:doStartup:OperatorStartup:failures 1:helm upgrade failed: cannot patch "mcplay" with kind Deployment: Deployment.apps "mcplay" is invalid: spec.template.spec.containers: Required value

Summary:
- 'main' queue: 8 tasks.
- 99 other queues (1 active, 98 empty): 1 task.
- total 9 tasks to handle.
miklezzzz commented 1 month ago

yet another example:

Queue 'main': length 12, status: 'run first task'

 1. GroupedModuleRun:main:Grouped run for echo, mcplay:OperatorStartup:failures 11:
    Errors:
    - echo: helm upgrade failed: cannot patch "echo-server" with kind Deployment: Deployment.apps "echo-server" is invalid: spec.template.spec.containers: Required value
    - mcplay: helm upgrade failed: cannot patch "mcplay" with kind Deployment: Deployment.apps "mcplay" is invalid: spec.template.spec.containers: Required value

 2. ConvergeModules:main:::Operator-Startup
 3. ModuleHookRun:main:kubernetes:002-deckhouse/hooks/change_host_ip.go:pod:Kubernetes
 4. ModuleRun:main:node-manager:Kubernetes-Change-ModuleValues
 5. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes
 6. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
 7. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
 8. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes
 9. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes
10. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
11. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
12. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes

Queue 'group_queue_0': length 1, status: 'sleep after fail for 21.4s (1s left of 21s delay)'

 1. ModuleRun:group_queue_0:echo:doStartup:OperatorStartup:failures 6:helm upgrade failed: cannot patch "echo-server" with kind Deployment: Deployment.apps "echo-server" is invalid: spec.template.spec.containers: Required value

Queue 'group_queue_1': length 1, status: 'sleep after fail for 13.3s (3s left of 13s delay)'

 1. ModuleRun:group_queue_1:mcplay:doStartup:OperatorStartup:failures 5:helm upgrade failed: cannot patch "mcplay" with kind Deployment: Deployment.apps "mcplay" is invalid: spec.template.spec.containers: Required value

Summary:
- 'main' queue: 12 tasks.
- 99 other queues (2 active, 97 empty): 2 tasks.
- total 14 tasks to handle.
diafour commented 2 weeks ago

Group makes it feel like a logic group of modules, e.g. "group of monitoring modules", "group of cni modules". Why don't name it according to PR description: ParallelModuleRun?

Also, there is a group parameter in kubernetes subscriptions.

miklezzzz commented 2 weeks ago

makes sense

miklezzzz commented 2 weeks ago
[deckhouse] deckhouse@dev-master-0 /deckhouse $ deckhouse-controller queue list
Queue 'main': length 33, status: 'run first task'

 1. ParallelModuleRun:main:Parallel run for cloud-data-crd, metallb-crd, operator-prometheus-crd, prometheus-crd, snapshot-controller-crd, user-authn-crd, vertical-pod-autoscaler-crd:OperatorStartup
 2. ModuleRun:main:flow-schema:doStartup:OperatorStartup
 3. ModuleRun:main:admission-policy-engine:doStartup:OperatorStartup
 4. ModuleRun:main:cloud-provider-openstack:doStartup:OperatorStartup
 5. ModuleRun:main:local-path-provisioner:doStartup:OperatorStartup
 6. ModuleRun:main:cni-flannel:doStartup:OperatorStartup
 7. ModuleRun:main:kube-proxy:doStartup:OperatorStartup
 8. ModuleRun:main:registry-packages-proxy:doStartup:OperatorStartup
 9. ParallelModuleRun:main:Parallel run for control-plane-manager, node-manager, terraform-manager:OperatorStartup
10. ModuleRun:main:kube-dns:doStartup:OperatorStartup
11. ModuleRun:main:snapshot-controller:doStartup:OperatorStartup
12. ModuleRun:main:cert-manager:doStartup:OperatorStartup
13. ModuleRun:main:user-authz:doStartup:OperatorStartup
14. ModuleRun:main:user-authn:doStartup:OperatorStartup
15. ModuleRun:main:operator-prometheus:doStartup:OperatorStartup
16. ModuleRun:main:prometheus:doStartup:OperatorStartup
17. ModuleRun:main:prometheus-metrics-adapter:doStartup:OperatorStartup
18. ModuleRun:main:vertical-pod-autoscaler:doStartup:OperatorStartup
19. ParallelModuleRun:main:Parallel run for extended-monitoring, monitoring-applications, monitoring-custom, monitoring-deckhouse, monitoring-kubernetes, monitoring-kubernetes-control-plane, monitoring-ping:OperatorStartup
20. ModuleRun:main:node-local-dns:doStartup:OperatorStartup
21. ModuleRun:main:metallb:doStartup:OperatorStartup
22. ModuleRun:main:l2-load-balancer:doStartup:OperatorStartup
23. ModuleRun:main:ingress-nginx:doStartup:OperatorStartup
24. ModuleRun:main:log-shipper:doStartup:OperatorStartup
25. ModuleRun:main:pod-reloader:doStartup:OperatorStartup
26. ModuleRun:main:chrony:doStartup:OperatorStartup
27. ParallelModuleRun:main:Parallel run for dashboard, operator-trivy, upmeter:OperatorStartup
28. ParallelModuleRun:main:Parallel run for namespace-configurator, secret-copier:OperatorStartup
29. ModuleRun:main:deckhouse-tools:doStartup:OperatorStartup
30. ModuleRun:main:documentation:doStartup:OperatorStartup
31. ParallelModuleRun:main:Parallel run for echo, mcplay:OperatorStartup
32. ConvergeModules:main:::Operator-Startup
33. ModuleHookRun:main:kubernetes:002-deckhouse/hooks/change_host_ip.go:pod:Kubernetes

Queue 'parallel_queue_0': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_0:snapshot-controller-crd:doStartup:OperatorStartup

Queue 'parallel_queue_1': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_1:user-authn-crd:doStartup:OperatorStartup

Queue 'parallel_queue_2': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_2:vertical-pod-autoscaler-crd:doStartup:OperatorStartup

Queue 'parallel_queue_3': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_3:cloud-data-crd:doStartup:OperatorStartup

Queue 'parallel_queue_4': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_4:metallb-crd:doStartup:OperatorStartup

Queue 'parallel_queue_5': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_5:operator-prometheus-crd:doStartup:OperatorStartup

Queue 'parallel_queue_6': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_6:prometheus-crd:doStartup:OperatorStartup

Summary:
- 'main' queue: 33 tasks.
- 14 other queues (7 active, 7 empty): 7 tasks.
- total 40 tasks to handle.