Canary upgrade without Ingress Controller/Service Mesh support

Describe the feature

In my company we have been evaluating several cloud native "progressive delivery" technologies and Flagger is the resulting winner (so many good things in flagger thanks to its really good/clear architecture that assures not to collide with GitOps reconciliations in case of GitOps based CD realization).

The only thing we are "missing" (compared to other solutions we evaluated, e.g. Argo-rollouts) is the support for Canary upgrade without the need for any Ingress controller/service mesh to steer the "production" traffic between primary and canary pods in a controlled/weighted way.

We have some canary-capable microservices that are not being exposed outside the k8s cluster (supporting other internal microservices) that are out of any mesh ... so not possible to run a canary upgrade on them using flagger controller

In Argo-rollouts there is the possibility to handle a canary upgrade on these workloads, playing with the number of primary and canary pods (so a coarse-grained traffic distribution based on the amount of pods for each is handled by a single ClusterIP k8s service selecting all primary and canary pods).

I wonder if this Use Case has been discussed in the community (I did not find any old github issue on this), I mean, some way to handle canary upgrades on these out-of-service-mesh and not-externally-exposed workloads.

Proposed solution

In #1157 it is clarified that GitOps compliance requires HPA in a flagger driven canary upgrade, what means flagger should not touch the min/max fields in the primary HPA object but it could in the canary one as it is hidden to GitOps agent(s). Having a ClusterIP service selecting the canary and primary pods plus flagger-controller setting the number of replicas in the canary Deployment (thanks to the canary objects being shadow objects hidden to the GitOps controllers) could be used to control how many pods we have for each (primary and canary) and so flagger-controller setting (in a coarse grained way) how much % of production traffic is handled by primary and canary pods

A constant monitoring of the /scale sub-resource in the primary Deployment object would be also required from the flagger-controller (as the primary HPA object would not be under "its" control so being required to constantly monitor it to adjust the number of replicas in the canary Deployment to get the desired % canary traffic in each canary-analysis step).

Could this be a feasible solution for this (possible) new feature in Flagger? Any other proposal/better solution for it?

fluxcd / flagger

Canary upgrade without Ingress Controller/Service Mesh support #1226

Describe the feature

Proposed solution