alexandrevilain / temporal-operator

Temporal Kubernetes Operator
https://temporal-operator.pages.dev/
Apache License 2.0
164 stars 37 forks source link

Failed to update cluster to enable mTLS with cert-manager #472

Open yujunz opened 1 year ago

yujunz commented 1 year ago

Modify the cluster CRD to enable mTLS.

Pods failed to start due to secrets not found

FailedMount
MountVolume.SetUp failed for volume "worker-mtls-certificate" : secret "temporal-tok-worker-mtls-certificate" not found
FailedMount
MountVolume.SetUp failed for volume "internode-certificate" : secret "temporal-tok-internode-certificate" not found
FailedMount
MountVolume.SetUp failed for volume "frontend-certificate" : secret "temporal-tok-frontend-certificate" not found
FailedMount
MountVolume.SetUp failed for volume "frontend-intermediate-ca-certificate" : secret "temporal-tok-frontend-intermediate-ca-certificate" not found

Despite they were already generated by cert-manager

kubectl get secrets
NAME                                                 TYPE                             DATA   AGE
temporal-tok-admintools-mtls-certificate             kubernetes.io/tls                3      18m
temporal-tok-aws-dev1-db-password                    Opaque                           1      6d18h
temporal-tok-aws-dev1-ro-db-password                 Opaque                           1      6d18h
temporal-tok-aws-dev1-rw-db-password                 Opaque                           1      6d18h
temporal-tok-frontend-certificate                    kubernetes.io/tls                3      18m
temporal-tok-frontend-intermediate-ca-certificate    kubernetes.io/tls                3      18m
temporal-tok-internode-certificate                   kubernetes.io/tls                3      18m
temporal-tok-internode-intermediate-ca-certificate   kubernetes.io/tls                3      18m
temporal-tok-root-ca-certificate                     kubernetes.io/tls                3      18m
temporal-tok-ui-mtls-certificate                     kubernetes.io/tls                3      18m
temporal-tok-worker-mtls-certificate                 kubernetes.io/tls                3      18m

Deleting the cluster and recreate works though.

alexandrevilain commented 1 year ago

Hi @yujunz !

Could you please give me some reproducing steps ? I tried on my side an it worked. I created a new cluster using this example where I removed the spec.mtls section. Once the cluster has been created I restored the mTLS section and it worked well.

bmorton commented 3 days ago

I am not able to reproduce this either. I provisioned the cluster with this:

apiVersion: temporal.io/v1beta1
kind: TemporalCluster
metadata:
  name: temporal
  namespace: temporal-mtls-repro
spec:
  version: 1.23.0
  numHistoryShards: 8
  persistence:
    defaultStore:
      sql:
        user: temporal
        pluginName: postgres
        databaseName: temporal
        connectAddr: temporal-db-rw:5432
        connectProtocol: tcp
      passwordSecretRef:
        name: temporal-db-credentials
        key: password
    visibilityStore:
      sql:
        user: temporal
        pluginName: postgres
        databaseName: temporal_visibility
        connectAddr: temporal-db-rw:5432
        connectProtocol: tcp
      passwordSecretRef:
        name: temporal-db-credentials
        key: password
  ui:
    enabled: true

I waited for the cluster to become healthy in ArgoCD. I checked the UI and everything looked healthy as well. I pushed another commit to add this section and re-synced in ArgoCD:

mTLS:
  provider: cert-manager
  internode:
    enabled: true
  frontend:
    enabled: true
  certificatesDuration:
    clientCertificates: 1h0m0s
    frontendCertificate: 1h0m0s
    intermediateCAsCertificates: 1h30m0s
    internodeCertificate: 1h0m0s
    rootCACertificate: 2h0m0s
  renewBefore: 55m0s

The Certificate objects took ~20s or so on my homelab cluster and all the Temporal deployments completed as soon as the Certificate objects were marked valid. I didn't have any clients connected, but it seemed like a pretty seamless change. I vote to close this issue and re-open if we have a reproduction.