Azure / msi-acrpull

Kubernetes controller that allows using Managed Service Identities (MSI) to pull images from ACR.
MIT License
24 stars 20 forks source link

controller: ensure pull secret names are valid #68

Closed stevekuznetsov closed 1 week ago

stevekuznetsov commented 2 months ago

controller: ensure pull secret names are valid

The previous logic for pull secret names was a simple string concatenation with the binding name. In the abstract, this may fail to generate a valid Kubernetes identifier when the binding's name is long. Since the binding's name is entirely user input, we need a different approach to ensure the controller is resilient. Simply taking a prefix will not suffice as it opens us up to collisions as well, so we hash the input. Taking base32(sha224(input))[:10], factoring in the 25% inefficiency with the encoding, we get around 50 bits of the hash, so our collision chance is around 1e-12 with O(100) bindings in one namespace. This gives us an excellent collision resistance given the known production usage patterns we have for this component.

While we're changing the name of the pull secret, we may as well make other breaking changes. Namely, we label the secrets with the binding that led to their creation so we can filter our informers and reduce the memory footprint of the controller.

Given that we're making a backwards incompatible change, we must now run a second controller on the upgrade path to smooth over the change. This controller will remove previous pull credentials when we've generated new ones for a binding, and exit the process when it's done. On startup, if we can detect that no legacy secrets exist, we filter our informers and don't start the second controller.

As these changes were fairly invasive, I also took the time to refactor how the main controller was structured to allow us to test its behavior. Previously, there were no tests for the reconcile function. In this iteration, I factored lower-case reconcile() to be a pure-ish function of input cluster state to output cluster state, which allows for simple unit tests, which now take O(2ms) as opposed to the previous O(20s).

I also enforced that the reconciliation loop in the controller only take one mutating action against the Kubernetes API server per pass. In general, every reconciliation loop must be written in such a way that it is robust to crashes and service disruptions between any two such calls, so factoring the loop to only ever take one is a simple way to ensure we are robust in the face of failures. Notably, this transformation found a couple of edge cases for error paths where in-memory data was necessary to correctly recover from a crash in the middle of a multi-call reconcilation, like when updating the service account failed after the secret was generated but before the binding was updated to reflect it.

We also will now clean up previously-created pull credential secrets during the finalizer phase.

Signed-off-by: Steve Kuznetsov skuznets@redhat.com


internal/controller: clean up service account bindings

It doesn't seem like we handled this case in the past, but if a user changes which service account the pull credential should be bound to, in addition to updating the new service account to reference the image pull credential, we need to make sure any previous service accounts no longer have the reference.

Signed-off-by: Steve Kuznetsov skuznets@redhat.com


codecov-commenter commented 2 months ago

Codecov Report

Attention: Patch coverage is 34.98452% with 210 lines in your changes missing coverage. Please review.

Project coverage is 44.35%. Comparing base (b04b1dc) to head (01e507b). Report is 25 commits behind head on main.

Files Patch % Lines
internal/controller/acrpullbinding_controller.go 50.76% 90 Missing and 7 partials :warning:
...rnal/controller/legacy_token_cleanup_controller.go 13.04% 79 Missing and 1 partial :warning:
cmd/main.go 0.00% 32 Missing :warning:
pkg/authorizer/types/accesstoken.go 50.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #68 +/- ## ========================================== - Coverage 51.86% 44.35% -7.51% ========================================== Files 7 12 +5 Lines 349 656 +307 ========================================== + Hits 181 291 +110 - Misses 138 332 +194 - Partials 30 33 +3 ``` | [Flag](https://app.codecov.io/gh/Azure/msi-acrpull/pull/68/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Azure) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/Azure/msi-acrpull/pull/68/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Azure) | `44.35% <34.98%> (-7.51%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Azure#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

stevekuznetsov commented 2 months ago

Fixes https://github.com/Azure/msi-acrpull/issues/20

stevekuznetsov commented 3 weeks ago

Will fix lame linter nits tomorrow.