Support Istio canary via multiple DestinationRule subsets (versus canary/stable services)

hexsel commented 4 years ago

Currently, according to https://argoproj.github.io/argo-rollouts/features/traffic-management/istio/, Istio canary deployments require separate services for stable and canary. This is undesirable, as I understand in-mesh communications won't hit the virtualservice directly (there's no DNS entry added by default for those in the Kubernetes cluster). This forces microservice-to-microservice communication to choose whether to hit the stable service or the canary (hardcoded), or go to the gateway, or DNS entries to be added for the virtualservices, which is cumbersome and poorly documented.

All Istio canary examples I could find use a single service, but have a destinationrule that splits them into 2 groups based on label selectors (for canary vs stable, or per version), then the virtualservice links this single service and splits the traffic accordingly.

Maybe a canary rollout should follow this principle? Or I may be misunderstanding this problem and this can be fixed with some more documentation?

jessesuen commented 4 years ago

All Istio canary examples I could find use a single service, but have a destinationrule that splits them into 2 groups based on label selectors (for canary vs stable, or per version), then the virtualservice links this single service and splits the traffic accordingly.

@hexsel I'd like to understand your proposal a little better. Could you provide the alternative resulting DestinationRule / VirtualService that achieves the canary using that means?

jessesuen commented 4 years ago

OK, my memory is a bit hazy (and I also might be wrong about this), but I think one of the motivations for current design was that two separate services were able to produce istio prometheus metrics that were not available with a single DestinationRule and VirtualService. In other words, it would be possible to graph and compare the metrics of a canary vs. stable service, but not two DestinationRule subsets.

jessesuen commented 4 years ago

That said, I think it is possible to enhance the rollout istio canary implementation to reference a DestinationRule which would be updated by the controller with the canary/stable rollouts-pod-template-hash values.

hexsel commented 4 years ago

Let me try to give an example:

microservice A is a public endpoint and has a gateway entry
microservice B is internal only, called by microservice A
microservice B should be upgraded via a canary method
because Argo Rollouts requires 2 separate services (stable and canary), and because microservice B isn't exposed through the gateway/virtualservice apparatus, microservice A has to hit the stable service for microservice B directly. This causes microservice B to basically never get any canary traffic.

I believe this is because there is no DNS entry for virtualservices, only for services. They should, according to my limited understanding, still route through the mesh, but now you have to forcefully pick the canary service or the production service.

But again, maybe I am misunderstanding some detail here?

hexsel commented 4 years ago

The proposal would be for there to be only one service, and the virtualservice would configure which replicaset gets hit, like in this old Istio example: https://istio.io/latest/blog/2017/0.1-canary/

Argo Rollout would then be responsible only for adjusting the weights in that VirtualService CRD.

hexsel commented 4 years ago

Regarding the metrics issue you mentioned, I admit I do not know if it is possible, but I found this: https://istio.io/latest/docs/tasks/observability/metrics/querying-metrics/

Which seems to indicate you could query metrics for different labels on the same virtual service.

jessesuen commented 4 years ago

Thanks for providing the use case. I believe what you are saying checks out.

Which seems to indicate you could query metrics for different labels on the same virtual service.

Ah. My information was based on what I saw in the default dashboard. But looks like what you are saying is possible.

I think I agree Rollouts should provide multiple options of performing canary:

the current approach, with canary/stable Services, and a VirtualService with canary/stable route destinations to those respective services
a new approach, with a single Service, a DestinationRule with canary/stable subsets, and a VirtualService with a canary/stable route destination to the canary/stable subsets.

Eslamanwar commented 4 years ago

@jessesuen we have the 503 issue when using canary/blue-green strategy with Istio , as istio doesnt expect 2 services points to the same pod , when can we expect the desitinationRule support for argo-rollout ,thank you

aweis89 commented 4 years ago

@jessesuen we have the 503 issue when using canary/blue-green strategy with Istio , as istio doesnt expect 2 services points to the same pod , when can we expect the desitinationRule support for argo-rollout ,thank you

Does it make sense then to eliminate the first option and drop support for canary/stable SVCs and just use subsets for Istio traffic routing? Or is it actually possible to have a single Virtual Service reference multiple SVCs (was not able to get this working myself either)?

bryonwinger commented 4 years ago

Using a single Service along with a Virtual Service and Destination Rule subsets also opens the door to easily add header based routing.

turkenh commented 3 years ago

I've been struggling with this for some time and I think I found a solution.

Currently, according to https://argoproj.github.io/argo-rollouts/features/traffic-management/istio/, Istio canary deployments require separate services for stable and canary. This is undesirable, as I understand in-mesh communications won't hit the virtualservice directly (there's no DNS entry added by default for those in the Kubernetes cluster).

To have in-mesh communications to hit the virtual service, you need to include reserved word mesh in spec.gateways, which is there by default if you didn't set any other gateway. It is true that DNS entry not added by default for virtual service and it looks like we can workaround this by using creating (yet another) service which you included in hosts of virtual service and when in-mesh communicates go to this service endpoint.

This forces microservice-to-microservice communication to choose whether to hit the stable service or the canary (hardcoded), or go to the gateway, or DNS entries to be added for the virtualservices, which is cumbersome and poorly documented.

With above workaround, when other microservices tried to connect to the service, traffic controlled by virtual service, hence canary works internally as well.

I think it would be more clear with an example. Considering above scenario:

Let me try to give an example:

microservice A is a public endpoint and has a gateway entry

microservice B is internal only, called by microservice A

microservice B should be upgraded via a canary method

because Argo Rollouts requires 2 separate services (stable and canary), and because microservice B isn't exposed through the gateway/virtualservice apparatus, microservice A has to hit the stable service for microservice B directly. This causes microservice B to basically never get any canary traffic.

For microservice B, we need to create:

3 services: canary-service-b, stable-service-b and service-b
A virtual service where spec.gateways includes mesh (if there other gateways set (e.g. microservice B also has public endpoints), otherwise leave spec.gateways as empty) and spec.hosts includes service-b. I believe name of this virtual service does not matter, but, I tested by setting as service-b.
A rollout with canaryService: canary-service-b and stableService: stable-service-b

We can now hardcode address of microservice B as service-b in microservice A. When it tries to connect to it, traffic flows through virtual service (not directly over k8s service) which is being controlled by argo rollout.

Still, I think this could just be a workaround for the current situation and canaries with multiple destination rule subsets should be the way to go (a.k.a new approach).

jessesuen commented 3 years ago

Thanks for the workaround @turkenh !

Regarding the previous comment about the following query to have custom queries on canary vs. stable:

So version is a special reserved word that you must apply as a label to your pod spec, in order to get this values populated in metrics. From docs (https://istio.io/latest/docs/ops/deployment/requirements/#pod-requirements):

Pods with app and version labels: We recommend adding an explicit app label and version label to the specification of the pods deployed using a Kubernetes Deployment. The app and version labels add contextual information to the metrics and telemetry that Istio collects.
- The app label: Each deployment should have a distinct app label with a meaningful value. The app label is used to add contextual information in distributed tracing.
- The version label: This label indicates the version of the application corresponding to the particular deployment.

So the query istio_requests_total{destination_service="reviews.default.svc.cluster.local", destination_version="v3"} would only work if somehow you could have different values for version on every update. It's not clear how this could easily be done (unless we were to build a feature for it in rollouts).

In other words, I still think that if you are only using a single service + 2 x destination rule subsets, and you are not using or changing the value of the version label on every update, then you won't be able to glean specific metrics on the canary vs. stable workloads.

It's probably doable by customizing metrics, but even after doing so, none of the default dashboards would support it out-of-the box. So I do think there is value in having multiple services, if you have the requirement to only get canary specific metrics.

I now wonder if a new rollout feature to inject version: <rollout revision number> on every update would help with this.

argoproj / argo-rollouts

Support Istio canary via multiple DestinationRule subsets (versus canary/stable services) #617