dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

Check DB Connectivity in Liveness probe #216

Open dhiaayachi opened 1 month ago

dhiaayachi commented 1 month ago

Is your feature request related to a problem? Please describe. In K8s environment there are situations when DB passwords and certificates are renewed. When that happens the current worker, frontend, matching and history behavior is to fail silently and log the error. The passwords are stored in K8s secrets and loaded as environment variables, in other words, a pod restart would resolve the issue.

Describe the solution you'd like Check DB connectivity in liveness probe. If the DB password is changed the temporal pods would restart and load the new secret.

Describe alternatives you've considered Using tctl namespace list as liveness probe exec command, but it can't connect to the localhost even if the correct port is specified. Plus, replacing the existing liveness probe feels hacky.

Additional context An example of the tctl n l failure when run inside the history pod.

temporal-history-67f566466-qz8n8:/etc/temporal$ tctl --address localhost:7234 n l
Error: Error when list namespaces info
Error Details: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: first record does not look like a TLS handshake"
('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)
dhiaayachi commented 1 month ago

Thanks for reporting this issue.

This is a feature request, and it is a great one. We understand that the current behavior of Temporal pods failing silently when they lose connectivity to the database is not ideal, and that a liveness probe that checks for database connectivity would be a valuable addition.

Currently, there is no built-in way to do this. However, here is a potential workaround that you can use.

  1. You can write a custom activity that checks the database connection and then use this activity in your liveness probe.

  2. You can write a custom activity that checks for database connection, and then call this activity from a loop in your Workflow. If the Activity fails, terminate the Workflow and restart it.

We encourage you to open a feature request on GitHub so we can consider it for future releases.

dhiaayachi commented 1 month ago

Thanks for reporting this issue.

Temporal Server and Temporal CLI are intended for development environments. In order to be able to connect to the localhost:7234 endpoint you must specify your client certificate and private key using the --tls-cert-path and --tls-key-path options. For example:

tctl --address localhost:7234 --tls-cert-path /path/to/cert.pem --tls-key-path /path/to/key.pem n l