mberhault commented 7 years ago

1674 introduces online certificate reload, with refreshes triggered by `SIGHUP`.

However, this does not work on Windows.

We should add additional ways of triggering certificate reload:

admin UI endpoint
cli command
maybe periodic refresh based on cert expiration

Epic: CRDB-6663

Jira issue: CRDB-6095

mberhault commented 4 years ago

@rolandcrosby for triage.

DuskEagle commented 4 years ago

We should support an option to reload certificates when CockroachDB detects a file change within the certificate directory. This would be very helpful in environments like Kubernetes, where it is difficult to signal the containers in a pod.

With auto-reload support, all that would be needed to update the certificates in a cluster would be updating the K8s secret storing the certificate. My one concern is that we want to avoid a bad K8s secret update taking down the entire cluster at once. Maybe it would be useful if we could verify that we can still communicate with the rest of the cluster after a reload, and if we can't, then we continue to use the old certificates in memory and report an error?

knz commented 4 years ago

Maybe it would be useful if we could verify that we can still communicate with the rest of the cluster after a reload, and if we can't, then we continue to use the old certificates in memory and report an error?

I think this is an important question but maybe it's orthogonal. My understanding is that the handling of SIGHUP currently does not prevent a misconfiguration (wrong file) from hosing the cluster.

Adding an additional method using e.g. a cluster RPC + CLI command would not make the problem any worse. Or would it? @aaron-crl can you help us think about this?

DuskEagle commented 4 years ago

In my mind, the key difference is that when using SIGHUP, you can only hose one node at a time. If our solution signals one node at a time and ensures the node can still communicate with the cluster afterward, then that's great. If it signals all the nodes at once, then we could have a problem.

aaron-crl commented 4 years ago

Any solution here should be able to signal one node at a time since this is a change that affects each node individually.

From a security and auditing standpoint I'd like to see this reload as a deliberate action attached to a node specific endpoint. Since the certificate updates are being pushed by an external orchestrator, it would make sense to provide a "reload cert" endpoint for that orchestration to hit as opposed to letting CRDB blackbox its update or overloading the SIGHUP operation - which doesn't seem like a correct functional mapping.

Reasons for a reload endpoint:

Auditing for major system changes (like change of a trust anchor) could be mapped to a security principle and logged accordingly.
Certificate update operations can be attempted and provide immediate feedback from the endpoint as opposed to requiring parsed log scanning; improving resiliency and stability (e.g. "ERROR: Cert Failed to Parse! Leaving existing certificate in place.")
Certs could ultimately be furnished to that endpoint and stored by the node in its own blob storage container; reducing the need for orchestration to touch the container filesystem for certificate updates and moving this update to an application level only change.

As far as ensuring that the certificate rollout doesn't break the internode communications, that would be the sort of thing you might consider rolling into such an orchestration endpoint so that it can more gracefully provide testing and rollback of new configurations.

petermattis commented 4 years ago

Could we (should we) add a SQL builtin to trigger cert reloading on a per-node basis? There is precedence for a SQL builtin that has a per-node effect: crdb_internal.set_vmodule. I'm imagining something like crdb_internal.reload_certs().

knz commented 4 years ago

There's an item on the CLI roadmap for 20.2 to migrate as many RPCs as possible over the SQL protocol. I suppose this will be covered there.

DuskEagle commented 4 years ago

If the SQL builtin allowed specifying the node ID to reload certificates for, that would make the orchestration much easier.

christianhuening commented 3 years ago

Is there any update here? We have certificates issued with cert-manager and they are going to expire soon'ish

knz commented 3 years ago

there is no update yet. Until then you can perform a rolling restart of your cluster to pick up the new certs if the signal method doesn't work for you.

knz commented 2 years ago

cc @jtsiros for triage and tracking

github-actions[bot] commented 11 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

george-angel commented 11 months ago

remove stale

cockroachdb / cockroach

security: add other methods to trigger certificate reload. #15028

1674 introduces online certificate reload, with refreshes triggered by `SIGHUP`.

cockroachdb / cockroach

security: add other methods to trigger certificate reload. #15028

1674 introduces online certificate reload, with refreshes triggered by SIGHUP.

1674 introduces online certificate reload, with refreshes triggered by `SIGHUP`.