Open ElanHasson opened 7 months ago
Hi @ElanHasson !
Thanks for reporting this issue. Do you have any steps to reproduce it ? Like manifests, operator version ...
Here you go @alexandrevilain
I have the durations set very high.
Set them low, then watch them expire and you'll see frontend fails and others too.
Also, when they are valid and low, increase the duration to something high. Expectation Is new certificates would be created with the new duration and they would be rotated by the reconciler immediately.
apiVersion: temporal.io/v1beta1
kind: TemporalCluster
metadata:
name: temporal
spec:
admintools:
enabled: true
image: temporalio/admin-tools
resources: {}
authorization:
# Uncomment once claims are setup in keycloak
# authorizer: default
# claimMapper: default
jwtKeyProvider:
keySourceURIs:
- CHANGE_ME
refreshInterval: 30m0s
permissionsClaimName: permissions
image: temporalio/server
jobResources: {}
jobTtlSecondsAfterFinished: 300
log:
development: false
format: json
level: info
outputFile: ""
stdout: true
mTLS:
certificatesDuration:
clientCertificates: 17520h0m0s
frontendCertificate: 17520h0m0s
intermediateCAsCertificates: 17520h0m0s
internodeCertificate: 17520h0m0s
rootCACertificate: 27520h0m0s
frontend:
enabled: true
internode:
enabled: true
provider: cert-manager
refreshInterval: 72h0m0s # I do not think these are being considered properly.
renewBefore: 1h10m0s # This too
metrics:
enabled: true
prometheus:
listenAddress: ""
listenPort: 9090
scrapeConfig:
annotations: false
serviceMonitor:
enabled: true
numHistoryShards: 1
persistence:
defaultStore:
name: default
passwordSecretRef:
key: password
name: temporal-postgres-app
skipCreate: false
sql:
connectAddr: temporal-postgres-rw:5432
connectProtocol: tcp
databaseName: temporal
maxConnLifetime: 0s
maxConns: 0
maxIdleConns: 0
pluginName: postgres
taskScanPartitions: 0
user: temporal
visibilityStore:
name: visibility
passwordSecretRef:
key: password
name: temporal-postgres-app
skipCreate: false
sql:
connectAddr: temporal-postgres-rw:5432
connectProtocol: tcp
databaseName: temporal_visibility
maxConnLifetime: 0s
maxConns: 0
maxIdleConns: 0
pluginName: postgres
taskScanPartitions: 0
user: temporal
services:
frontend:
overrides:
deployment:
spec:
template:
spec:
containers:
- name: service
args: ["--allow-no-auth"]
httpPort: 7243
membershipPort: 6933
port: 7233
replicas: 1
resources: {}
history:
httpPort: 0
membershipPort: 6934
port: 7234
replicas: 1
resources: {}
matching:
httpPort: 0
membershipPort: 6935
port: 7235
replicas: 1
resources: {}
worker:
httpPort: 0
membershipPort: 6939
port: 7239
replicas: 1
resources: {}
ui:
enabled: true
image: temporalio/ui
ingress:
annotations:
external-dns.alpha.kubernetes.io/cloudflare-proxied: "false" # this will need to be true when we are out of the basement
cert-manager.io/issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-buffer-size: 128k
hosts:
- CHANGEME
tls:
- hosts:
- CHANGEME
secretName: temporal-ingress-cert
overrides:
deployment:
spec:
template:
spec:
containers:
- env:
- name: TEMPORAL_AUTH_ENABLED
value: "true"
- name: TEMPORAL_AUTH_SCOPES
value: openid email profile
name: ui
resources: {}
envFrom:
- secretRef:
name: temporal-oidc-client
replicas: 1
resources:
limits:
cpu: "1"
memory: 256Mi
requests:
cpu: 10m
memory: 20Mi
version: 2.26.1
version: 1.22.7
Thanks for this!
I’ll try to find what’s figuring out as soon as possible!
Looking at your manifest, feel free to open an issue to improve ui configuration to simplify auth stuff, looks like it could be improved!
I am running into this issue too, but I think I have a reproduction. This is my current config:
apiVersion: temporal.io/v1beta1
kind: TemporalCluster
metadata:
name: temporal
namespace: temporal-repro
spec:
version: 1.23.0
numHistoryShards: 8
persistence:
defaultStore:
sql:
user: temporal
pluginName: postgres
databaseName: temporal
connectAddr: temporal-db-rw:5432
connectProtocol: tcp
passwordSecretRef:
name: temporal-db-credentials
key: password
visibilityStore:
sql:
user: temporal
pluginName: postgres
databaseName: temporal_visibility
connectAddr: temporal-db-rw:5432
connectProtocol: tcp
passwordSecretRef:
name: temporal-db-credentials
key: password
ui:
enabled: true
mTLS:
provider: cert-manager
internode:
enabled: true
frontend:
enabled: true
certificatesDuration:
clientCertificates: 1h0m0s
frontendCertificate: 1h0m0s
intermediateCAsCertificates: 1h30m0s
internodeCertificate: 1h0m0s
rootCACertificate: 2h0m0s
refreshInterval: 5m0s
If I look at the docs for cert-manager's Certificate resource, they support a renewBefore
, but not a refreshInterval
. If I use refreshInterval
as above, the certificate refresh timestamp appears to be a default. If I use renewBefore
as 55m0s
to flip the 5m0s
, I can see this on the Certificate
and the refresh timestamp is 5m from issuance. Refreshing the Certificate
object just now, I can see that it refreshed. I'm able to reload the UI, but the shortest expiration interval that cert-manager supports is 1h, so I've got another ~50m before expiration to see if this is all working now.
I'm still learning the internals of the operator, but if I dig into where I think the Certificate
spec is being generated, I don't see how refreshInterval
would be accounted for since it references renewBefore
directly and Certificate
does not natively support a refreshInterval
.
Given this, I wonder if we should remove refreshInterval
from the TemporalCluster
spec and only support renewBefore
to better align with cert-manager. Alternatively, I think we could do duration - refreshInterval
if we wanted to support that, but it seems unnecessary.
Well, I can see that the Certificate objects are rotating, but I am still running into issues once certificates expire. I went to the UI and tried to create a Schedule for a test action and it returns:
last connection error: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA certificate\")"
If I go back and check the logs of one of the deployments, in this case the Temporal worker, I start to see these first:
2024-11-22 06:53:37.277Z {"level":"warn","ts":"2024-11-22T06:53:37.277Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temporal-worker-7c4d8564db-6w4s5@","WorkerType":"WorkflowWorker","Error":"last connection error: connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2024-11-22T06:53:34Z is after 2024-11-22T06:50:32Z\"","logging-call-at":"internal_worker_base.go:333"}
There are many of those logs, but they eventually turn into:
2024-11-22 07:52:55.968Z {"level":"warn","ts":"2024-11-22T07:52:55.967Z","msg":"Failed to poll for task.","service":"worker","Namespace":"default","TaskQueue":"temporal-sys-per-ns-tq","WorkerID":"server-worker@1@temporal-worker-7c4d8564db-6w4s5@default","WorkerType":"WorkflowWorker","Error":"last connection error: connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \\\"crypto/rsa: verification error\\\" while trying to verify candidate authority certificate \\\"Root CA certificate\\\")\"","logging-call-at":"internal_worker_base.go:333"}
I'm still digging into if I can stabilize this somehow and why this is happening, but wanted to report these findings first.
I restarted all the deployments as well and it doesn't seem to stabilize. I'm digging around look at the Certificate manifests and such, but I'm not sure what to look at next yet.
Certificate management is not my strong suit, but I was thinking about this a bit more and I think these durations are overly aggressive. If the root and intermediate CAs are expiring pretty often, then do we have windows of time where we need to trust multiple root CAs because certs could be issued on a range of them?
I'm going to try this config with much wider duration windows, but I don't know if I'm just delaying the problem still. I think so long as the root CA expiration is sufficiently long, things will be fine. However, once the root CA expires, you'll have a period where certificates need to be reissued after the root CA expired.
spec:
mTLS:
provider: cert-manager
internode:
enabled: true
frontend:
enabled: true
certificatesDuration:
# End-entity certificates
clientCertificates: 24h
frontendCertificate: 24h
internodeCertificate: 24h
# Intermediate and root certificates
intermediateCAsCertificates: 168h # 1 week
rootCACertificate: 720h # 30 days
renewBefore: 8h
It seems like we need to handle multiple root CAs somehow, though I'm not sure I entirely understand this yet. Does that make sense?
Alright -- I have this working now and I think the root CA thing above doesn't feel like a problem.
In my last test, I realized I had some Certificate/Secret objects that stuck around after I cleaned up things in ArgoCD. If I use renewBefore
and I make sure there aren't any Certificate or Secret objects from a previous cluster attempt, renewal with the tighter durations has worked for the past ~10 hours now.
This feels like maybe we can fix it with some documentation?
Hmm, spoke too soon. On the short config above, I still get an error when trying to take actions in the UI:
last connection error: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA certificate\")"
Will continue to investigate here.
Thanks for your work @bmorton ! Appreciate it. Feel free to reach me if you need some help digging this issue.
I see the issue where the certs are rotated and everything basically crashloops or fails requests due to expired certs. I have to delete the cert secrets and also the deployments to let the operator reconcile it properly.
I saw the same in #509 For CA certs, but it happens for the other certs too.