CF does not rollout sidecar updates to internal workloads

braunsonm commented 3 years ago

Describe the bug

See this architecture decision for context: https://github.com/cloudfoundry/cf-k8s-networking/blob/35f3ed39e26293aef16a5afae36d7a5271c13c77/doc/architecture-decisions/0016-job-for-upgrading-istio-sidecars-on-workloads.md

Currently when a new istio version is released the control plane is updated first, then all cf-workloads pods are restarted with the new sidecar.

This does not apply to cf-system and cf-blobstore workloads. Istio only guarantees that sidecars will be forward compatible with 1 minor revision. If a component is not updated in CF updates between then it could begin running a version older than the control plane supports.

This has recently happened to all operators in the move from v3.0.0 -> v4.0.0 -> v4.1.0.

So far this hasn't caused any issues but could in the future and is not considered good practice to run outdated sidecars.

You can also verify in the istio docs their last step of the upgrade process instructs you to restart all data plane workloads: https://istio.io/latest/docs/setup/upgrade/in-place/

To Reproduce*

Steps to reproduce the behavior:

Deploy v3
Upgrade to v4.0.0
Upgrade to v4.1.0
Notice that cf-blobstore is now running a sidecar that is unsupported by the istio control plane. (v1.7.3)

Expected behavior

All CF components should be restarted after a control plane update to istio for compatibility.

The job for this should be tweaked to rollout the update to move than just cf-workloads and fluentd: https://github.com/cloudfoundry/cf-k8s-networking/blob/35f3ed39e26293aef16a5afae36d7a5271c13c77/ci/dockerfiles/upgrade/roll.sh#L15

I would be curious to hear from the networking team for the rationale in only restarting fluentd.

Additional context

cf-for-k8s SHA

v4.1.0

Cluster information

AKS

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/177985728

The labels on this github issue will be updated when the story is started.

acosta11 commented 3 years ago

Hi @braunsonm ,

Thanks for the issue with the relevant networking repo context. For visibility, we have a few stories in the backlog to investigate solutions to potentially undesirable upgrade deployment behavior. See https://www.pivotaltracker.com/story/show/177879772 and https://www.pivotaltracker.com/story/show/177914742 .

Regarding the current fluentd focus, see the Istio 1.8 upgrade story for further details https://www.pivotaltracker.com/story/show/176916441, but the tldr is that fluentd did in fact break during the upgrade prompting another look at the more general system control plane upgrade strategy. But given the significant deployment time cost of rerolling every container on upgrade, we opted to focus the fluentd re-roll for now.

Given the open questions around this new problem, perhaps a more manual mitigation may be useful in the near term. Do you think it would be sufficient to update the upgrade documentation to include a manual system control plane re-roll to remain in line with the istio documentation?

Thanks, Andrew and @matt-royal

braunsonm commented 3 years ago

Thanks for the context @acosta11

Documenting sounds like a good short term solution. That is what I have done internally as well. In addition to documenting that cf-system should get the sidecars upgraded, it might be a good idea to document how to adjust the kapp timeouts for those not familiar. The default is 15 minutes. In our ~10 node cluster that is not enough time to fully drain and restart all the ingress gateways that run.

I think longer term it would be good for CF to have a discussion about compartmentalizing the CF-for-k8s deployment to optionally allow a "bring your own" networking stack. We customize the CF provided networking stack quite a bit already in order to have container to container networking, have additional security features around AuthorizationPolicy's (automated oauth flows for traffic going to CF workloads) and for extending our cluster to have better observability. Additional context on that here: https://github.com/cloudfoundry/cf-for-k8s/issues/623

bkrannich commented 3 years ago

@braunsonm:

I think longer term it would be good for CF to have a discussion about compartmentalizing the CF-for-k8s deployment to optionally allow a "bring your own" networking stack.

I think that's very much in line with what we have been distributing via the CF mailing list recently in terms of a vision for CF: https://docs.google.com/document/d/1rG814raI5UfGUsF_Ycrr8hKQMo1RH9TRMxuvkgHSdLg/edit

davewalter commented 3 years ago

Please see #665 for our proposed short-term, Istio-centric solution.

Birdrock commented 3 years ago

Should be fixed with v4.2.0 release. Please re-open if it is still an issue.

cloudfoundry / cf-for-k8s