alexandrevilain / temporal-operator

Temporal Kubernetes Operator
https://temporal-operator.pages.dev/
Apache License 2.0
164 stars 36 forks source link

Rotated certificates cause all services to crashloop/failure to refresh #708

Open ElanHasson opened 7 months ago

ElanHasson commented 7 months ago

I see the issue where the certs are rotated and everything basically crashloops or fails requests due to expired certs. I have to delete the cert secrets and also the deployments to let the operator reconcile it properly.

I saw the same in #509 For CA certs, but it happens for the other certs too.

alexandrevilain commented 7 months ago

Hi @ElanHasson !

Thanks for reporting this issue. Do you have any steps to reproduce it ? Like manifests, operator version ...

ElanHasson commented 7 months ago

Here you go @alexandrevilain

I have the durations set very high.

Set them low, then watch them expire and you'll see frontend fails and others too.

Also, when they are valid and low, increase the duration to something high. Expectation Is new certificates would be created with the new duration and they would be rotated by the reconciler immediately.

apiVersion: temporal.io/v1beta1
kind: TemporalCluster
metadata:
  name: temporal
spec:
  admintools:
    enabled: true
    image: temporalio/admin-tools
    resources: {}
  authorization:
    # Uncomment once claims are setup in keycloak
    # authorizer: default
    # claimMapper: default
    jwtKeyProvider:
      keySourceURIs:
        - CHANGE_ME
      refreshInterval: 30m0s
    permissionsClaimName: permissions
  image: temporalio/server
  jobResources: {}
  jobTtlSecondsAfterFinished: 300
  log:
    development: false
    format: json
    level: info
    outputFile: ""
    stdout: true
  mTLS:
    certificatesDuration:
      clientCertificates: 17520h0m0s
      frontendCertificate: 17520h0m0s
      intermediateCAsCertificates: 17520h0m0s
      internodeCertificate: 17520h0m0s
      rootCACertificate: 27520h0m0s
    frontend:
      enabled: true
    internode:
      enabled: true
    provider: cert-manager
    refreshInterval: 72h0m0s # I do not think these are being considered properly.
    renewBefore: 1h10m0s # This too
  metrics:
    enabled: true
    prometheus:
      listenAddress: ""
      listenPort: 9090
      scrapeConfig:
        annotations: false
        serviceMonitor:
          enabled: true
  numHistoryShards: 1
  persistence:
    defaultStore:
      name: default
      passwordSecretRef:
        key: password
        name: temporal-postgres-app
      skipCreate: false
      sql:
        connectAddr: temporal-postgres-rw:5432
        connectProtocol: tcp
        databaseName: temporal
        maxConnLifetime: 0s
        maxConns: 0
        maxIdleConns: 0
        pluginName: postgres
        taskScanPartitions: 0
        user: temporal
    visibilityStore:
      name: visibility
      passwordSecretRef:
        key: password
        name: temporal-postgres-app
      skipCreate: false
      sql:
        connectAddr: temporal-postgres-rw:5432
        connectProtocol: tcp
        databaseName: temporal_visibility
        maxConnLifetime: 0s
        maxConns: 0
        maxIdleConns: 0
        pluginName: postgres
        taskScanPartitions: 0
        user: temporal
  services:
    frontend:
      overrides:
        deployment:
          spec:
            template:
              spec:
                containers:
                  - name: service
                    args: ["--allow-no-auth"]
      httpPort: 7243
      membershipPort: 6933
      port: 7233
      replicas: 1
      resources: {}
    history:
      httpPort: 0
      membershipPort: 6934
      port: 7234
      replicas: 1
      resources: {}
    matching:
      httpPort: 0
      membershipPort: 6935
      port: 7235
      replicas: 1
      resources: {}
    worker:
      httpPort: 0
      membershipPort: 6939
      port: 7239
      replicas: 1
      resources: {}
  ui:
    enabled: true
    image: temporalio/ui
    ingress:
      annotations:
        external-dns.alpha.kubernetes.io/cloudflare-proxied: "false" # this will need to be true when we are out of the basement
        cert-manager.io/issuer: letsencrypt-prod
        nginx.ingress.kubernetes.io/proxy-buffer-size: 128k
      hosts:
        - CHANGEME
      tls:
        - hosts:
            - CHANGEME
          secretName: temporal-ingress-cert
    overrides:
      deployment:
        spec:
          template:
            spec:
              containers:
                - env:
                    - name: TEMPORAL_AUTH_ENABLED
                      value: "true"
                    - name: TEMPORAL_AUTH_SCOPES
                      value: openid email profile
                  name: ui
                  resources: {}
                  envFrom:
                    - secretRef:
                        name: temporal-oidc-client
    replicas: 1
    resources:
      limits:
        cpu: "1"
        memory: 256Mi
      requests:
        cpu: 10m
        memory: 20Mi
    version: 2.26.1
  version: 1.22.7
alexandrevilain commented 7 months ago

Thanks for this!

I’ll try to find what’s figuring out as soon as possible!

Looking at your manifest, feel free to open an issue to improve ui configuration to simplify auth stuff, looks like it could be improved!

bmorton commented 1 week ago

I am running into this issue too, but I think I have a reproduction. This is my current config:

apiVersion: temporal.io/v1beta1
kind: TemporalCluster
metadata:
  name: temporal
  namespace: temporal-repro
spec:
  version: 1.23.0
  numHistoryShards: 8
  persistence:
    defaultStore:
      sql:
        user: temporal
        pluginName: postgres
        databaseName: temporal
        connectAddr: temporal-db-rw:5432
        connectProtocol: tcp
      passwordSecretRef:
        name: temporal-db-credentials
        key: password
    visibilityStore:
      sql:
        user: temporal
        pluginName: postgres
        databaseName: temporal_visibility
        connectAddr: temporal-db-rw:5432
        connectProtocol: tcp
      passwordSecretRef:
        name: temporal-db-credentials
        key: password
  ui:
    enabled: true
  mTLS:
    provider: cert-manager
    internode:
      enabled: true
    frontend:
      enabled: true
    certificatesDuration:
      clientCertificates: 1h0m0s
      frontendCertificate: 1h0m0s
      intermediateCAsCertificates: 1h30m0s
      internodeCertificate: 1h0m0s
      rootCACertificate: 2h0m0s
    refreshInterval: 5m0s

If I look at the docs for cert-manager's Certificate resource, they support a renewBefore, but not a refreshInterval. If I use refreshInterval as above, the certificate refresh timestamp appears to be a default. If I use renewBefore as 55m0s to flip the 5m0s, I can see this on the Certificate and the refresh timestamp is 5m from issuance. Refreshing the Certificate object just now, I can see that it refreshed. I'm able to reload the UI, but the shortest expiration interval that cert-manager supports is 1h, so I've got another ~50m before expiration to see if this is all working now.

I'm still learning the internals of the operator, but if I dig into where I think the Certificate spec is being generated, I don't see how refreshInterval would be accounted for since it references renewBefore directly and Certificate does not natively support a refreshInterval.

Given this, I wonder if we should remove refreshInterval from the TemporalCluster spec and only support renewBefore to better align with cert-manager. Alternatively, I think we could do duration - refreshInterval if we wanted to support that, but it seems unnecessary.

bmorton commented 1 week ago

Well, I can see that the Certificate objects are rotating, but I am still running into issues once certificates expire. I went to the UI and tried to create a Schedule for a test action and it returns:

last connection error: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA certificate\")"

If I go back and check the logs of one of the deployments, in this case the Temporal worker, I start to see these first:

2024-11-22 06:53:37.277Z {"level":"warn","ts":"2024-11-22T06:53:37.277Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temporal-worker-7c4d8564db-6w4s5@","WorkerType":"WorkflowWorker","Error":"last connection error: connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2024-11-22T06:53:34Z is after 2024-11-22T06:50:32Z\"","logging-call-at":"internal_worker_base.go:333"}

There are many of those logs, but they eventually turn into:

2024-11-22 07:52:55.968Z {"level":"warn","ts":"2024-11-22T07:52:55.967Z","msg":"Failed to poll for task.","service":"worker","Namespace":"default","TaskQueue":"temporal-sys-per-ns-tq","WorkerID":"server-worker@1@temporal-worker-7c4d8564db-6w4s5@default","WorkerType":"WorkflowWorker","Error":"last connection error: connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \\\"crypto/rsa: verification error\\\" while trying to verify candidate authority certificate \\\"Root CA certificate\\\")\"","logging-call-at":"internal_worker_base.go:333"}

I'm still digging into if I can stabilize this somehow and why this is happening, but wanted to report these findings first.

bmorton commented 1 week ago

I restarted all the deployments as well and it doesn't seem to stabilize. I'm digging around look at the Certificate manifests and such, but I'm not sure what to look at next yet.

bmorton commented 1 week ago

Certificate management is not my strong suit, but I was thinking about this a bit more and I think these durations are overly aggressive. If the root and intermediate CAs are expiring pretty often, then do we have windows of time where we need to trust multiple root CAs because certs could be issued on a range of them?

I'm going to try this config with much wider duration windows, but I don't know if I'm just delaying the problem still. I think so long as the root CA expiration is sufficiently long, things will be fine. However, once the root CA expires, you'll have a period where certificates need to be reissued after the root CA expired.

spec:
  mTLS:
    provider: cert-manager
    internode:
      enabled: true
    frontend:
      enabled: true
    certificatesDuration:
      # End-entity certificates
      clientCertificates: 24h
      frontendCertificate: 24h
      internodeCertificate: 24h

      # Intermediate and root certificates
      intermediateCAsCertificates: 168h  # 1 week
      rootCACertificate: 720h    # 30 days
    renewBefore: 8h

It seems like we need to handle multiple root CAs somehow, though I'm not sure I entirely understand this yet. Does that make sense?

bmorton commented 1 week ago

Alright -- I have this working now and I think the root CA thing above doesn't feel like a problem.

In my last test, I realized I had some Certificate/Secret objects that stuck around after I cleaned up things in ArgoCD. If I use renewBefore and I make sure there aren't any Certificate or Secret objects from a previous cluster attempt, renewal with the tighter durations has worked for the past ~10 hours now.

This feels like maybe we can fix it with some documentation?

bmorton commented 1 week ago

Hmm, spoke too soon. On the short config above, I still get an error when trying to take actions in the UI:

last connection error: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA certificate\")"

Will continue to investigate here.

alexandrevilain commented 1 week ago

Thanks for your work @bmorton ! Appreciate it. Feel free to reach me if you need some help digging this issue.