argoproj / argo-rollouts

Progressive Delivery for Kubernetes

https://argo-rollouts.readthedocs.io/

Apache License 2.0

2.76k stars 866 forks source link

Daemonset support #1457

Open batthebee opened 3 years ago

batthebee commented 3 years ago

Summary

Support of canary updates etc. for daemonset's

Use Cases

For tools like kubeedge it makes sense to address workloads per daemon sets. For this it would be great to be able to use argo rollouts to roll out the workload. For example, a canary update should first address a few nodes and roll out the rest after success.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

mumoshu commented 3 years ago

I'm interested in this feature, too 👍

Would this be in line with the direction maintainers want to go? If so, I'm eager to contribute anything I can.

FWIW, I have seen this happened in a similar(my impression) project called Flagger. It resulted in a huge refactor, extracting the Controller interface to abstract away Deployment, DaemonSet, and Service.

Controller interface: https://github.com/fluxcd/flagger/pull/378/files#diff-0918f1aa419d96e810030b3a2374d96c3b9261d5aef9a50e2ead602ab0e3f24dR5-R19 DaemonSet support: https://github.com/fluxcd/flagger/pull/455 Service support: https://github.com/fluxcd/flagger/pull/372

I believe that a comparable amount of refactoring would be needed to do this in Argo Rollouts, too.

I would start by extracting out some new interface out of rolloutContext, perhaps named workloadContext, deployer or else, so that a new kind like DaemonSet can be supported by implementing the interface.

https://sourcegraph.com/github.com/argoproj/argo-rollouts/-/blob/rollout/context.go?L14:6#tab=references

API-wise, probably we'd need to update the usage of Rollout resource's WorkloadRef? AFAIK, Argo Rollouts uses WorkloadRefs only for extracting pod templates today.

If we were to add support for DaemonSet, it needs to be polymorphic depending on the kind of WorkloadRef, instantiating an appropriate workloadContext implementation according to the kind.

@jessesuen Hi! Do you have any thoughts about this?

jessesuen commented 3 years ago

I'm interested in this feature, too 👍 Would this be in line with the direction maintainers want to go? If so, I'm eager to contribute anything I can.

@mumoshu I do think the community would be very interested in this. And hats off to you for this ambitious effort! Support for DaemonSets would be great to have, though a lot of things in the rollout spec will be not applicable so we should do what we can to eliminate confusion.

You are right that the codebase is heavily geared towards Deployments/ReplicaSets because, well, we actually started off using the Kubernetes Deployment controller codebase. There was a lot of code-reused from there.

API-wise, probably we'd need to update the usage of Rollout resource's WorkloadRef? AFAIK, Argo Rollouts uses WorkloadRefs only for extracting pod templates today. If we were to add support for DaemonSet, it needs to be polymorphic depending on the kind of WorkloadRef, instantiating an appropriate workloadContext implementation according to the kind.

I have some questions about what it means to have DaemonSet canaries. When a Rollout references a DeamonSet where there are 10 nodes, and an update happens with a 10% canary, is the expectation that 1 out of 10 will run the new version and 9/10 continue to run the old? How would this be accomplished?

Would there be support for blue-green DaemonSets? e.g. During an update, deploy two versions V1 and V2 on each node. But because they run with different labels (pod-template-hashes), we could modify an active service to shift traffic to DaemonSet v2 atomically.

Another thing is that we need to be conscious of is RBAC. DaemonSets are often restricted by cluster admins as not something end users get to deploy. The guarantee that we would need to have is: just because a user can create a Rollout, should not necessarily mean they can create pods on every node. e.g.:

spec:
  template:
    spec: ....
  daemonSet: true

I think if we stick to only allowing DaemonSets to work using the workloadRef feature, then we sidestep the issue because if they can create a DeamonSet in the namespace, then it is safe to assume that we can create copies of it for the purposes of canarying.

mumoshu commented 3 years ago

@jessesuen Hey!! Thank you so much for your detailed comment.

we actually started off using the Kubernetes Deployment controller codebase. There was a lot of code-reused from there.

Ahh that makes sense!

When a Rollout references a DeamonSet where there are 10 nodes, and an update happens with a 10% canary, is the expectation that 1 out of 10 will run the new version and 9/10 continue to run the old? How would this be accomplished?

I was thinking of using some node selector similar to Flagger has used in https://github.com/fluxcd/flagger/pull/455/files#diff-5b82d873db5e83d2c5e55ac78a8db04acb28005e570741c6924edc78669ad6a2R19

For example, you label the source daemonset that is referenced from workfloadRed with rollouts.argoproj.io/scale-to-0": "true". There will be no nodes labeled with "scale-to-0" so having that node selecto results in daemonset pods not created.

Rollouts creates the canary daemonset from the source daemonset. At its creation time the canary daemon set has scale-to-0 label too so its pods aren't scheduled anywhere.

Would there be support for blue-green DaemonSets?

Yes I believe so.

In a blue-green deployment, Rollouts shall update the canary daemonset's node selector to e.g. scale-to-100. All nodes should be labeled with scale-to-100 so that the canary daemonset's pods are scheduled onto all nodes. A stable daemonset should always omit scale-to-* node selector so that its pods are scheduled onto all the nodes.

I have some questions about what it means to have DaemonSet canaries...

In a canary deployment, Rollouts shall update the canary daemonset to e.g. scale-to-10 so that daemonset pods are scheduled only onto the nodes that have scale-to-10 flag. I'm not sure if the user is required to label all the nodes outside of Rollouts, or Rollouts should dynamically (re)label nodes. Maybe the latter is more user-friendly. It would require a lot of node labels to be added and manipulated for the latter case so there might be some permission issue(should rollouts be allowed to label nodes? maybe yes but not everyone will be happy about that?) but I need a bit more research for more concrete proposal.

I think if we stick to only allowing DaemonSets to work using the workloadRef feature, then we sidestep the issue because if they can create a DeamonSet in the namespace, then it is safe to assume that we can create copies of it for the purposes of canarying.

Yes, I was thinking about using workflowRef only.

I'm not yet sure how I could model it nicely with embedded workflow definition.

Do you think that making this Daemonset support support workflowRef only is a valid approach?

Or must we add support for non-ref case for e.g. consistency reason? 🤔

jessesuen commented 3 years ago

Thank you for the detailed answers. Things are much clearer now. I now understand how it's possible to limit DeamonSets by labeling nodes.

I'm not sure if the user is required to label all the nodes outside of Rollouts, or Rollouts should dynamically (re)label nodes. Maybe the latter is more user-friendly. It would require a lot of node labels to be added and manipulated for the latter case so there might be some permission issue(should rollouts be allowed to label nodes? maybe yes but not everyone will be happy about that?) but I need a bit more research for more concrete proposal.

I think we both agree that it would be unreasonable to ask the user to manage nodes of labels, especially with thing like cluster autoscaler where users will not have an opportunity to do this. Plus I imagine would be very error prone. So I do believe our only option would be to manage the node labels for canary purposes.

Do you think that making this Daemonset support support workflowRef only is a valid approach? Or must we add support for non-ref case for e.g. consistency reason? 🤔

I think from the perspective of RBAC and security, we go with the workloadRef approach. A cluster admin would not be happy that their users could circumvent the ability to create Daemonset-like objects, by deploying a Rollout object in some DaemonSet mode.

If we really wanted to, we could introduce a different RolloutDaemonSet for the purposes of finer grained RBAC control, but I think people would be satisfied with less CRDs, and just the one way to do it (workloadRef to DeamonSet).

mumoshu commented 3 years ago

@jessesuen Thanks for your confirmation!

Plus I imagine would be very error prone. So I do believe our only option would be to manage the node labels for canary purposes.

I fully agree with you.

A cluster admin would not be happy that their users could circumvent the ability to create Daemonset-like objects, by deploying a Rollout object in some DaemonSet mode.

Good point!

If we really wanted to, we could introduce a different RolloutDaemonSet for the purposes of finer grained RBAC control, but I think people would be satisfied with less CRDs, and just the one way to do it (workloadRef to DeamonSet).

Makes sense. Then I'll begin with the workloadRef approach for the DaemonSet support after I managed to finish my refactoring on rolloutContext 👍

mumoshu commented 3 years ago

@jessesuen Would you mind if I submitted a huge pull request to refactor it to extract out the Deployer interface to abstract away most of replicaset-related code?

Here's the whole diff of my branch:

https://github.com/mumoshu/argo-rollouts/compare/c72a354...refactor-replicaset-deployer

And here's the state of the Deployer interface:

https://github.com/mumoshu/argo-rollouts/blob/d2d064330aa7f7d5957530eebb0d18551f1b4116/rollout/replicaset.go#L41-L55

I've already removed most of the replicaset-smell out of rolloutContext and the new interface. I think it's mostly done once I managed to remove or replace the following functions in it:

    ReconcileNewReplicaSet() (bool, error)
    ScaleReplicaSetAndRecordEvent(rs *appsv1.ReplicaSet, newScale int32) (bool, *appsv1.ReplicaSet, error)
    GetAllReplicaSetsAndSyncRevision(createIfNotExisted bool) (*appsv1.ReplicaSet, error)

ReconcileNewReplicaSet is currently for creating a new replicaset but a similar process is required for the DaemonSet support so I'd just rename it to ReconcileNew or ReconcileNewWorkload, or anything like that.

ScaleReplicaSetAndRecordEvent might be an implementation detail specific to the replicaset deployer so I'll move it out of the interface and try to introduce another generic function that makes sense for DaemonSet too.

GetAllReplicaSetsAndSyncRevision is for managing old replicasets according to the rev history limit and something similar might be needed for DaemonSet, too. Perhaps I'll just rename it to GetAll(Workloads)AndSyncRevision.

I can explain it even more carefully if you ask. But the question here is that- can I really submit this as a huge single pull request? 🤔

I'm afraid it's too hard to review. If you're happy with it, I'll just go ahead and submit a pull request shortly.

Otherwise, I'd appreciate your guidance. For example, if you want me to write a proposal, what you'd like to be covered in the proposal? Which commit(s) in my branch do you want a separate pull request?

xavipanda commented 3 years ago

IMHO.. gigantic refactor with a big risk of introducing behaviours that eventually may affect the project. But with enough tests and a fleet big enough where both Deployments and DaemonSets rollouts creates a solid and documented analysis.. it can be a good feature and a scenario for extending rollouts into other types.

But i actually have some questions and how the rollout approach would be within ur idea, and its mostly based on what are the frequent scenarios for DaemonSets that we consume (agents/etc):

prometheus node_exporter: NodePorts ?
traefik maesh: transient node connectivity issues during rollout / rollback.
ceph/gluster/etc: access to PVCs

I think that scope and uniformity is a predominant factor in the majority of the workload of DaemonSets.. and workload type-like ingress as DaemonSet is a very small % of how DaemonSet is being used across the industry today.

I think we both agree that it would be unreasonable to ask the user to manage nodes of labels, especially with thing like cluster autoscaler where users will not have an opportunity to do this. Plus I imagine would be very error prone. So I do believe our only option would be to manage the node labels for canary purposes.

I wonder what would be the effect of the weight calculation when autoscaler bumps in/out a noticeable amount of nodes, a scenario that takes place constantly on AI/ML type of workload.. and since this is in all nodes, definitively scaleUp/scaleDown during canary could face often a situation where the weight of stable and canary are altered substantially. In this aspect Argo interacts with HPA and is a bit more in control over the scale/spread.. the daemonsets may not be that simple to dominate.

Regarding the possibility between A+B, based on the common use case of DaemonSet. they are typically used to spread 1 pod per node and thats not a coincidence but a need.

Honestly I feel like StatefulSets have a much broader impact as a feature and could have larger adoption by the users, but it's just my impression.. u can achieve STS's to behave similarly as how DS's works.

Just for curiosity, can you describe a bit the workload that u run on DS's so I can understand ur motivation a bit better please ?

mumoshu commented 3 years ago

IMHO.. gigantic refactor with a big risk of introducing behaviours that eventually may affect the project. But with enough tests and a fleet big enough where both Deployments and DaemonSets rollouts creates a solid and documented analysis.. it can be a good feature and a scenario for extending rollouts into other types.

I hear you. But if you read diff on my branch, you'd probably notice that all the units tests are passing and I'm mostly moving functions from one place to another and renaming functions. There's no logic change.

In terms of "risk" that eventually affect the project, I'd rather say it's in the current state of code.

Currently, pkg/rollout is a big monolith. Inside it, the controller, syncer, bluegreen, and canary-related code have cross-dependencies to each other. The analyzer and experiment-related code have tight-coupling to replicaset-related code. So, if you were to add some feature to Argo Rollouts, you may be required to add update replicaset-related code scattered across many source files.

Once someone managed to refactor it, you'll be required to touch one source file only when you're changing some replicaset-related thing. Adding support for DaemonSet shouldn't affect other sources.

I'm not saying a huge refactor to not have risk. I'm just saying that it has benefit of reducing other risk, too :)

I think that scope and uniformity is a predominant factor in the majority of the workload of DaemonSets.. and workload type-like ingress as DaemonSet is a very small % of how DaemonSet is being used across the industry today.

I generally agree. It isn't perfect. But my opinion is that the DaemonSet is still worth the effort. And my target users of DaemonSet are ones who run ingress daemonsets with NodePort and sorts of node agents(metrics, logs, traces forwarder or something like that).

The rule of thumb here is that there will be technically no way to provide zero downtime blue/green or canary deployment for DaemonSet whose pod uses hostPort without SO_REUSEPORT, privileged containers with binding the host port directly.

NodePort should be fine. But sometimes you'd better use externalTrafficPolicy: Local and the service topology feature to restrict each NodePort to route traffic to the node-local pods. Actually, that's how I suppose daemosets for cluster-internal ingress will be supported.

I wonder what would be the effect of the weight calculation when autoscaler bumps in/out a noticeable amount of nodes, a scenario that takes place constantly on AI/ML type of workload.. and since this is in all nodes, definitively scaleUp/scaleDown during canary could face often a situation where the weight of stable and canary are altered substantially

I think you've nailed a good point here! My theory would be that the imaginary DaemonSet support would work great if it temporarily pauses the canary rollout if it detected new nodes OR detected deleted nodes, then recalculates or "reshards" the nodes with new "sclae-to-x" node labels<->nodes mapping, and restarts the canary rollout.

Honestly I feel like StatefulSets have a much broader impact as a feature

Yeah, I can imagine that :)

To be clear, I won't try to add STS support myself anytime soon as I don't have my own usecase for it "yet". But if anyone like me managed to finish the big refactor to abstract the replicaset-specific logic away, it should be relatively easy to add support for STS, too.

Does this answer your question?

mumoshu commented 3 years ago

Oh, note that I imagine there will be two flavours of STS support :)

One flavour would deploy two or more statefulsets and do blue-green or canary releases across those.

Another flavour would somehow modify a single statefulset to gradually replace canary pods with stable pods. You shall be required to reimplmenet the K8s statefulset controller inside Rollouts, or hack with an admission webhook server to mutate pods managed by the statefulset.

The former might be relatively easy to implement but wouldn't provide 100% coverage on various use-cases of StatefulSet. The latter is relatively hard to implement but would provide more coverage.

Anyway, something like the Deployer interface I've extracted in my refactoring branch should provide a solid foundation for these two flavors as well, if done correctly.

syndicut commented 2 years ago

Just for the reference, there is also https://github.com/DataDog/extendeddaemonset for daemonset with canaries, but I like argo-rollouts support for extended canary analysis =)

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.

chanhz commented 1 year ago

can use Openkruise Advanced Daemonset.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.