Open mberhault opened 7 years ago
@rolandcrosby for triage.
We should support an option to reload certificates when CockroachDB detects a file change within the certificate directory. This would be very helpful in environments like Kubernetes, where it is difficult to signal the containers in a pod.
With auto-reload support, all that would be needed to update the certificates in a cluster would be updating the K8s secret storing the certificate. My one concern is that we want to avoid a bad K8s secret update taking down the entire cluster at once. Maybe it would be useful if we could verify that we can still communicate with the rest of the cluster after a reload, and if we can't, then we continue to use the old certificates in memory and report an error?
Maybe it would be useful if we could verify that we can still communicate with the rest of the cluster after a reload, and if we can't, then we continue to use the old certificates in memory and report an error?
I think this is an important question but maybe it's orthogonal. My understanding is that the handling of SIGHUP currently does not prevent a misconfiguration (wrong file) from hosing the cluster.
Adding an additional method using e.g. a cluster RPC + CLI command would not make the problem any worse. Or would it? @aaron-crl can you help us think about this?
In my mind, the key difference is that when using SIGHUP, you can only hose one node at a time. If our solution signals one node at a time and ensures the node can still communicate with the cluster afterward, then that's great. If it signals all the nodes at once, then we could have a problem.
Any solution here should be able to signal one node at a time since this is a change that affects each node individually.
From a security and auditing standpoint I'd like to see this reload as a deliberate action attached to a node specific endpoint. Since the certificate updates are being pushed by an external orchestrator, it would make sense to provide a "reload cert" endpoint for that orchestration to hit as opposed to letting CRDB blackbox its update or overloading the SIGHUP operation - which doesn't seem like a correct functional mapping.
Reasons for a reload endpoint:
As far as ensuring that the certificate rollout doesn't break the internode communications, that would be the sort of thing you might consider rolling into such an orchestration endpoint so that it can more gracefully provide testing and rollback of new configurations.
Could we (should we) add a SQL builtin to trigger cert reloading on a per-node basis? There is precedence for a SQL builtin that has a per-node effect: crdb_internal.set_vmodule
. I'm imagining something like crdb_internal.reload_certs()
.
There's an item on the CLI roadmap for 20.2 to migrate as many RPCs as possible over the SQL protocol. I suppose this will be covered there.
If the SQL builtin allowed specifying the node ID to reload certificates for, that would make the orchestration much easier.
Is there any update here? We have certificates issued with cert-manager and they are going to expire soon'ish
there is no update yet. Until then you can perform a rolling restart of your cluster to pick up the new certs if the signal method doesn't work for you.
cc @jtsiros for triage and tracking
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
remove stale
1674 introduces online certificate reload, with refreshes triggered by
SIGHUP
.However, this does not work on Windows.
We should add additional ways of triggering certificate reload:
Epic: CRDB-6663
Jira issue: CRDB-6095