IBM / portieris

A Kubernetes Admission Controller for verifying image trust.
Apache License 2.0
332 stars 78 forks source link

Portieris fails because it does not reload the new certificate rotated by cert-manager #463

Open pre opened 3 months ago

pre commented 3 months ago

When cert-manager rotates the certificate, the new certificate is not loaded by Portieris.

As a result, Portieris keeps using the old certificate and eventually fails with "remote error: tls: bad certificate".

Portieris v0.13.12 is installed via Helm chart with UseCertManager: true in values.yaml.

Logs

To debug the issue, I switched the mutation webhook to failurePolicy: Ignore and tried recreating the Pods. The logs below are about that:

  1. When two Portieris replicas are recreated, they work. If one of them becomes the new leader, Portieris will successfully admit the image requests.
  2. By switching back to failurePolicy: Fail, and then terminating these two functional Pods, the old Pod will become the leader.
  3. Once the old Pod becomes the leader, it will again fail with "remote error: tls: bad certificate".

The only way to fix this issue has so far been to temporarily disable the admission webhook, and then recreate the Portieris Pods.

cert-manager

❯ kl -n cert-manager -l app=cert-manager
Defaulted container "cert-manager-controller" out of: cert-manager-controller, install-oneagent (init)
Defaulted container "cert-manager-controller" out of: cert-manager-controller, install-oneagent (init)
I0802 03:32:24.637692       1 reflector.go:351] Caches populated for *v1.ClusterIssuer from k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
I0802 03:32:24.757350       1 reflector.go:351] Caches populated for *v1.Challenge from k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
I0802 05:09:51.001795       1 trigger_controller.go:215] "Certificate must be re-issued" logger="cert-manager.certificates-trigger" key="portieris/portieris-certs" reason="Renewing" message="Renewing certificate as renewal was scheduled at 2024-08-02 05:09:51 +0000 UTC"
I0802 05:09:51.001822       1 conditions.go:203] Setting lastTransitionTime for Certificate "portieris-certs" condition "Issuing" to 2024-08-02 05:09:51.001816753 +0000 UTC m=+11379.283695354
I0802 05:09:51.437471       1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-key-manager" key="portieris/portieris-certs" error="Operation cannot be fulfilled on certificates.cert-manager.io \"portieris-certs\": the object has been modified; please apply your changes to the latest version and try again"
I0802 05:09:51.762935       1 conditions.go:263] Setting lastTransitionTime for CertificateRequest "portieris-certs-3" condition "Approved" to 2024-08-02 05:09:51.762923817 +0000 UTC m=+11380.044802426
I0802 05:09:52.007428       1 conditions.go:263] Setting lastTransitionTime for CertificateRequest "portieris-certs-3" condition "Ready" to 2024-08-02 05:09:52.007415836 +0000 UTC m=+11380.289294441
I0802 05:09:52.313919       1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-readiness" key="portieris/portieris-certs" error="Operation cannot be fulfilled on certificates.cert-manager.io \"portieris-certs\": the object has been modified; please apply your changes to the latest version and try again"
I0802 05:09:52.416001       1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-key-manager" key="portieris/portieris-certs" error="Operation cannot be fulfilled on certificates.cert-manager.io \"portieris-certs\": the object has been modified; please apply your changes to the latest version and try again"
I0802 05:09:52.430845       1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-readiness" key="portieris/portieris-certs" error="Operation cannot be fulfilled on certificates.cert-manager.io \"portieris-certs\": the object has been modified; please apply your changes to the latest version and try again"

portieris

❯ kg pod
kNAME                         READY   STATUS    RESTARTS   AGE
portieris-86cf58bdbb-8gh2l   1/1     Running   0          10h
portieris-86cf58bdbb-pnw46   1/1     Running   0          2d22h
portieris-86cf58bdbb-sjpqh   1/1     Running   0          10h
❯ kl portieris-86cf58bdbb-sjpqh
Defaulted container "portieris" out of: portieris, install-oneagent (init)
I0802 02:00:55.313589       1 main.go:66] Starting portieris v0.13.12
I0802 02:00:55.313915       1 kube.go:57] No --kubeconfig flag found and KUBECONFIG env variable is NOT set, defaulting to in-cluster kube client config
I0802 02:00:55.314302       1 main.go:76] CA not provided at /etc/certs/ca.pem, will use default system pool
I0802 02:00:55.328370       1 webhook.go:129] Starting policy Webhook on port 8000...
2024/08/02 02:01:50 http: TLS handshake error from 100.96.2.3:47582: read tcp 100.96.1.150:8000->100.96.2.3:47582: read: connection reset by peer
I0802 02:03:48.716874       1 controller.go:64] Processing admission request for CREATE on
I0802 02:03:52.929302       1 controller.go:64] Processing admission request for UPDATE on drain-nodes-28709400-jlkb2
I0802 02:04:27.797460       1 controller.go:64] Processing admission request for CREATE on
I0802 02:04:27.897402       1 controller.go:64] Processing admission request for CREATE on
2024/08/02 02:04:27 http: TLS handshake error from 100.96.2.3:34258: EOF
I0802 02:04:28.155166       1 controller.go:64] Processing admission request for CREATE on
I0802 02:09:43.652576       1 controller.go:64] Processing admission request for UPDATE on frontend-web-68f7685479
I0802 03:00:32.335708       1 controller.go:64] Processing admission request for CREATE on
I0802 04:03:14.259740       1 controller.go:64] Processing admission request for UPDATE on frontend-web
I0802 04:03:14.260296       1 controller.go:176] Getting policy for container image: ourregistry.example.com/our-frontend:git-559036cdfbf19e50f4fd0a6aa5d0ec792c51af70   namespace: frontend-pr-1310
E0802 04:03:14.464155       1 secret.go:68] Error: secrets "default-registry-credentials" not found
E0802 04:03:14.464820       1 controller.go:253] secrets "default-registry-credentials" not found
I0802 04:03:14.464837       1 controller.go:145] Allow for images:  [ourregistry.example.com/our-frontend:git-559036cdfbf19e50f4fd0a6aa5d0ec792c51af70]
I0802 04:12:18.072189       1 controller.go:64] Processing admission request for UPDATE on frontend-web
I0802 04:12:18.074693       1 controller.go:176] Getting policy for container image: ourregistry.example.com/our-frontend:git-b9b4fc632a100839c1f038bba85c859ff6441940   namespace: frontend
I0802 04:12:18.281484       1 controller.go:261] ImagePullSecret frontend/default-registry-credentials found
I0802 04:12:18.281618       1 controller.go:145] Allow for images:  [ourregistry.example.com/our-frontend:git-b9b4fc632a100839c1f038bba85c859ff6441940]
I0802 04:12:18.322135       1 controller.go:64] Processing admission request for CREATE on frontend-web-757968dc7f
I0802 04:12:18.828037       1 controller.go:64] Processing admission request for CREATE on
I0802 04:12:42.601730       1 controller.go:64] Processing admission request for UPDATE on frontend-web-6d6c57748f
2024/08/02 06:34:02 http: TLS handshake error from 100.96.2.3:40672: remote error: tls: bad certificate

❯ kl portieris-86cf58bdbb-8gh2l
Defaulted container "portieris" out of: portieris, install-oneagent (init)
I0802 02:00:17.457744       1 main.go:66] Starting portieris v0.13.12
I0802 02:00:17.458043       1 kube.go:57] No --kubeconfig flag found and KUBECONFIG env variable is NOT set, defaulting to in-cluster kube client config
I0802 02:00:17.458598       1 main.go:76] CA not provided at /etc/certs/ca.pem, will use default system pool
I0802 02:00:17.474531       1 webhook.go:129] Starting policy Webhook on port 8000...
2024/08/02 06:40:34 http: TLS handshake error from 100.96.2.3:32860: remote error: tls: bad certificate
2024/08/02 06:41:48 http: TLS handshake error from 100.96.2.3:55028: remote error: tls: bad certificate
[..]
2024/08/02 12:07:56 http: TLS handshake error from 100.96.2.3:49854: remote error: tls: bad certificate
2024/08/02 12:08:41 http: TLS handshake error from 100.96.2.3:53730: remote error: tls: bad certificate
2024/08/02 12:08:42 http: TLS handshake error from 100.96.2.3:53742: remote error: tls: bad certificate

[.. At 12:08:44 another Pod portieris-86cf58bdbb-q8n5z was recreated, this is the last failure,
    until request switched to just recreated portieris-86cf58bdbb-q8n5z which processed them successfully]

2024/08/02 12:09:51 http: TLS handshake error from 100.96.2.3:48708: remote error: tls: bad certificate

❯ kl portieris-86cf58bdbb-q8n5z
Defaulted container "portieris" out of: portieris, install-oneagent (init)
I0802 12:08:44.864114       1 main.go:66] Starting portieris v0.13.12
I0802 12:08:44.864384       1 kube.go:57] No --kubeconfig flag found and KUBECONFIG env variable is NOT set, defaulting to in-cluster kube client config
I0802 12:08:44.865084       1 main.go:76] CA not provided at /etc/certs/ca.pem, will use default system pool
I0802 12:08:44.879743       1 webhook.go:129] Starting policy Webhook on port 8000...
I0802 12:09:19.017647       1 controller.go:64] Processing admission request for UPDATE on portieris-86cf58bdbb-s2r5n

[.. request failed 5 seconds ago at old Pod portieris-86cf58bdbb-8gh2l but succeeds now]

I0802 12:09:56.402993       1 controller.go:64] Processing admission request for UPDATE on redis-master-79c8964f6c-nx4j4
I0802 12:09:56.555898       1 controller.go:64] Processing admission request for CREATE on

❯ kg pod
NAME                         READY   STATUS    RESTARTS   AGE
portieris-86cf58bdbb-8gh2l   1/1     Running   0          10h
portieris-86cf58bdbb-flnhz   1/1     Running   0          6m10s
portieris-86cf58bdbb-q8n5z   1/1     Running   0          6m10s

❯ k delete pod portieris-86cf58bdbb-q8n5z &
> k delete pod portieris-86cf58bdbb-flnhz &

Deleting the two recently created functional Pods causes new image admission requests go to the old Pod portieris-86cf58bdbb-8gh2l.

The old Pod still fails with "remote error: tls: bad certificate".

Certificates

❯ kg certificate NAME READY SECRET AGE portieris-certs True portieris-certs 120d

❯ kg secret NAME TYPE DATA AGE portieris-certs kubernetes.io/tls 3 120d

Portieris' deployment has:

    volumeMounts:
    - mountPath: /etc/certs
      name: portieris-certs
      readOnly: true

Error

 failed calling webhook "trust.hooks.securityenforcement.admission.cloud.ibm.com": failed to call webhook: Post "https://portieris.portieris.svc:443/admit?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority
pre commented 1 month ago

A workaround with Portieris:

I feel bad about the complexity of having a combination of both stakater/reloader and portieris be operational in order to not lock down the cluster due to a bug In Portieris that doesn't seem to get fixed.

Possible alternatives for Portieris