Suggestions to improve the `restart-workloads-for-istio` Job during upgrades

tcdowney commented 3 years ago

Is your feature request related to a problem? Please describe.

In our cf-for-k8s environment we use kbld to point image references at an internal Harbor container registry. Since we use kapp to deploy CF, during upgrades it sees these image swaps as a change to the restart-workloads-for-istio Job and this causes issues. Since the Job is supposed to be immutable, we end up seeing errors like:

kapp: Error: Applying update job/restart-workloads-for-istio1-7-3 (batch/v1) namespace: cf-workloads:
  Updating resource job/restart-workloads-for-istio1-7-3 (batch/v1) namespace: cf-workloads:
    Job.batch "restart-workloads-for-istio1-7-3" is invalid: 

  - spec.selector: Required value

  - spec.template.metadata.labels: Invalid value: map[string]string{"kapp.k14s.io/app":"1611270931769226278", "kapp.k14s.io/association":"v1.e5b5a9e57bd2af1378eb74fe2a883535"}: `selector` does not match template `labels`

  - spec.selector: Invalid value: "null": field is immutable

  ...

 (reason: Invalid)

Describe the solution you'd like

I believe the Job is idempotent and that it won't roll workloads that are already using the correct version of the Istio sidecar proxy, but we should double check this (if it's not, it should be made to be).

Assuming that's the case, we could potentially use the same update strategy as the ccdb-migrate job (or maybe even always-replace) and essentially have it re-run on every deploy (see these kapp docs for more details).

I think this could be useful since it will ensure that pods eventually have the correct sidecar version if something goes wrong the first time the Job runs (some error or potential race condition for apps pushed during the upgrade) and would work around this issue with kapp.

We could also set the ttlSecondsAfterFinished property on the Job so that it removes itself sometime afterwards for cleanliness.

Describe alternatives you've considered

For this specific issue you can also work around it by applying the kapp.k14s.io/update-strategy: "skip" annotation to the Job so that kapp won't try to update it. That's what we did in our environment with a ytt overlay, but I think there are benefits to the other strategies.

Additional context

The implementation of this Job lives here: https://github.com/cloudfoundry/cf-k8s-networking/blob/develop/ci/dockerfiles/upgrade/roll.sh#L5

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176772648

The labels on this github issue will be updated when the story is started.

heycait commented 3 years ago

Closing this as it was completed in the commit linked above

cloudfoundry / cf-k8s-networking