Closed braunsonm closed 3 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/177985728
The labels on this github issue will be updated when the story is started.
Hi @braunsonm ,
Thanks for the issue with the relevant networking repo context. For visibility, we have a few stories in the backlog to investigate solutions to potentially undesirable upgrade deployment behavior. See https://www.pivotaltracker.com/story/show/177879772 and https://www.pivotaltracker.com/story/show/177914742 .
Regarding the current fluentd focus, see the Istio 1.8 upgrade story for further details https://www.pivotaltracker.com/story/show/176916441, but the tldr is that fluentd did in fact break during the upgrade prompting another look at the more general system control plane upgrade strategy. But given the significant deployment time cost of rerolling every container on upgrade, we opted to focus the fluentd re-roll for now.
Given the open questions around this new problem, perhaps a more manual mitigation may be useful in the near term. Do you think it would be sufficient to update the upgrade documentation to include a manual system control plane re-roll to remain in line with the istio documentation?
Thanks, Andrew and @matt-royal
Thanks for the context @acosta11
Documenting sounds like a good short term solution. That is what I have done internally as well. In addition to documenting that cf-system should get the sidecars upgraded, it might be a good idea to document how to adjust the kapp timeouts for those not familiar. The default is 15 minutes. In our ~10 node cluster that is not enough time to fully drain and restart all the ingress gateways that run.
I think longer term it would be good for CF to have a discussion about compartmentalizing the CF-for-k8s deployment to optionally allow a "bring your own" networking stack. We customize the CF provided networking stack quite a bit already in order to have container to container networking, have additional security features around AuthorizationPolicy
's (automated oauth flows for traffic going to CF workloads) and for extending our cluster to have better observability. Additional context on that here: https://github.com/cloudfoundry/cf-for-k8s/issues/623
@braunsonm:
I think longer term it would be good for CF to have a discussion about compartmentalizing the CF-for-k8s deployment to optionally allow a "bring your own" networking stack.
I think that's very much in line with what we have been distributing via the CF mailing list recently in terms of a vision for CF: https://docs.google.com/document/d/1rG814raI5UfGUsF_Ycrr8hKQMo1RH9TRMxuvkgHSdLg/edit
Please see #665 for our proposed short-term, Istio-centric solution.
Should be fixed with v4.2.0 release. Please re-open if it is still an issue.
Describe the bug
See this architecture decision for context: https://github.com/cloudfoundry/cf-k8s-networking/blob/35f3ed39e26293aef16a5afae36d7a5271c13c77/doc/architecture-decisions/0016-job-for-upgrading-istio-sidecars-on-workloads.md
Currently when a new istio version is released the control plane is updated first, then all
cf-workloads
pods are restarted with the new sidecar.This does not apply to
cf-system
andcf-blobstore
workloads. Istio only guarantees that sidecars will be forward compatible with 1 minor revision. If a component is not updated in CF updates between then it could begin running a version older than the control plane supports.This has recently happened to all operators in the move from
v3.0.0
->v4.0.0
->v4.1.0
.So far this hasn't caused any issues but could in the future and is not considered good practice to run outdated sidecars.
You can also verify in the istio docs their last step of the upgrade process instructs you to restart all data plane workloads: https://istio.io/latest/docs/setup/upgrade/in-place/
To Reproduce*
Steps to reproduce the behavior:
Expected behavior
All CF components should be restarted after a control plane update to istio for compatibility.
The job for this should be tweaked to rollout the update to move than just cf-workloads and fluentd: https://github.com/cloudfoundry/cf-k8s-networking/blob/35f3ed39e26293aef16a5afae36d7a5271c13c77/ci/dockerfiles/upgrade/roll.sh#L15
I would be curious to hear from the networking team for the rationale in only restarting fluentd.
Additional context
cf-for-k8s SHA
v4.1.0
Cluster information
AKS