Closed hexsel closed 3 years ago
All Istio canary examples I could find use a single service, but have a destinationrule that splits them into 2 groups based on label selectors (for canary vs stable, or per version), then the virtualservice links this single service and splits the traffic accordingly.
@hexsel I'd like to understand your proposal a little better. Could you provide the alternative resulting DestinationRule / VirtualService that achieves the canary using that means?
OK, my memory is a bit hazy (and I also might be wrong about this), but I think one of the motivations for current design was that two separate services were able to produce istio prometheus metrics that were not available with a single DestinationRule and VirtualService. In other words, it would be possible to graph and compare the metrics of a canary vs. stable service, but not two DestinationRule subsets.
That said, I think it is possible to enhance the rollout istio canary implementation to reference a DestinationRule which would be updated by the controller with the canary/stable rollouts-pod-template-hash values.
Let me try to give an example:
I believe this is because there is no DNS entry for virtualservices, only for services. They should, according to my limited understanding, still route through the mesh, but now you have to forcefully pick the canary service or the production service.
But again, maybe I am misunderstanding some detail here?
The proposal would be for there to be only one service, and the virtualservice would configure which replicaset gets hit, like in this old Istio example: https://istio.io/latest/blog/2017/0.1-canary/
Argo Rollout would then be responsible only for adjusting the weights in that VirtualService CRD.
Regarding the metrics issue you mentioned, I admit I do not know if it is possible, but I found this: https://istio.io/latest/docs/tasks/observability/metrics/querying-metrics/
Which seems to indicate you could query metrics for different labels on the same virtual service.
Thanks for providing the use case. I believe what you are saying checks out.
Which seems to indicate you could query metrics for different labels on the same virtual service.
Ah. My information was based on what I saw in the default dashboard. But looks like what you are saying is possible.
I think I agree Rollouts should provide multiple options of performing canary:
@jessesuen we have the 503 issue when using canary/blue-green strategy with Istio , as istio doesnt expect 2 services points to the same pod , when can we expect the desitinationRule support for argo-rollout ,thank you
@jessesuen we have the 503 issue when using canary/blue-green strategy with Istio , as istio doesnt expect 2 services points to the same pod , when can we expect the desitinationRule support for argo-rollout ,thank you
Does it make sense then to eliminate the first option and drop support for canary/stable SVCs and just use subsets for Istio traffic routing? Or is it actually possible to have a single Virtual Service reference multiple SVCs (was not able to get this working myself either)?
Using a single Service along with a Virtual Service and Destination Rule subsets also opens the door to easily add header based routing.
I've been struggling with this for some time and I think I found a solution.
Currently, according to https://argoproj.github.io/argo-rollouts/features/traffic-management/istio/, Istio canary deployments require separate services for stable and canary. This is undesirable, as I understand in-mesh communications won't hit the virtualservice directly (there's no DNS entry added by default for those in the Kubernetes cluster).
To have in-mesh communications to hit the virtual service, you need to include reserved word mesh
in spec.gateways
, which is there by default if you didn't set any other gateway. It is true that DNS entry not added by default for virtual service and it looks like we can workaround this by using creating (yet another) service which you included in hosts of virtual service and when in-mesh communicates go to this service endpoint.
This forces microservice-to-microservice communication to choose whether to hit the stable service or the canary (hardcoded), or go to the gateway, or DNS entries to be added for the virtualservices, which is cumbersome and poorly documented.
With above workaround, when other microservices tried to connect to the service, traffic controlled by virtual service, hence canary works internally as well.
I think it would be more clear with an example. Considering above scenario:
Let me try to give an example:
- microservice A is a public endpoint and has a gateway entry
- microservice B is internal only, called by microservice A
- microservice B should be upgraded via a canary method
- because Argo Rollouts requires 2 separate services (stable and canary), and because microservice B isn't exposed through the gateway/virtualservice apparatus, microservice A has to hit the stable service for microservice B directly. This causes microservice B to basically never get any canary traffic.
For microservice B, we need to create:
canary-service-b
, stable-service-b
and service-b
spec.gateways
includes mesh
(if there other gateways set (e.g. microservice B also has public endpoints), otherwise leave spec.gateways
as empty) and spec.hosts
includes service-b
. I believe name of this virtual service does not matter, but, I tested by setting as service-b
.rollout
with canaryService: canary-service-b
and stableService: stable-service-b
We can now hardcode address of microservice B as service-b
in microservice A. When it tries to connect to it, traffic flows through virtual service (not directly over k8s service) which is being controlled by argo rollout.
Still, I think this could just be a workaround for the current situation and canaries with multiple destination rule subsets should be the way to go (a.k.a new approach).
Thanks for the workaround @turkenh !
Regarding the previous comment about the following query to have custom queries on canary vs. stable:
So version
is a special reserved word that you must apply as a label to your pod spec, in order to get this values populated in metrics. From docs (https://istio.io/latest/docs/ops/deployment/requirements/#pod-requirements):
So the query istio_requests_total{destination_service="reviews.default.svc.cluster.local", destination_version="v3"}
would only work if somehow you could have different values for version
on every update. It's not clear how this could easily be done (unless we were to build a feature for it in rollouts).
In other words, I still think that if you are only using a single service + 2 x destination rule subsets, and you are not using or changing the value of the version
label on every update, then you won't be able to glean specific metrics on the canary vs. stable workloads.
It's probably doable by customizing metrics, but even after doing so, none of the default dashboards would support it out-of-the box. So I do think there is value in having multiple services, if you have the requirement to only get canary specific metrics.
I now wonder if a new rollout feature to inject version: <rollout revision number>
on every update would help with this.
Currently, according to https://argoproj.github.io/argo-rollouts/features/traffic-management/istio/, Istio canary deployments require separate services for stable and canary. This is undesirable, as I understand in-mesh communications won't hit the virtualservice directly (there's no DNS entry added by default for those in the Kubernetes cluster). This forces microservice-to-microservice communication to choose whether to hit the stable service or the canary (hardcoded), or go to the gateway, or DNS entries to be added for the virtualservices, which is cumbersome and poorly documented.
All Istio canary examples I could find use a single service, but have a destinationrule that splits them into 2 groups based on label selectors (for canary vs stable, or per version), then the virtualservice links this single service and splits the traffic accordingly.
Maybe a canary rollout should follow this principle? Or I may be misunderstanding this problem and this can be fixed with some more documentation?