cli: the `--insecure` flag will be removed and replaced by easier-to-use always-secure options

knz commented 4 years ago

Epic: CRDB-12037

The --insecure flag disables a number of security controls which are completely expected even for developers or folk trying out CockroachDB.

Instead, we aim to ensure that all clusters are secure, but offer new secure options where the user can choose the combination of security features that best matches their needs and their environment.

(Additionally, the word "insecure" gives the wrong impression that CockroachDB is not a secure database. The "insecure mode" only really exists for internal testing by the CockroachDB team and should not have been exposed to users from the start.)

What you need to know in a post-insecure world
- New clusters
- Existing clusters
Rationale

What you need to know in a post-insecure world

CockroachDB aims to remain easy to use when all clusters are secured!

New clusters

Simplified decision making:

for developers, single machine, no security requirements:
- cockroach demo, automatically secure
- single-node secure cluster, with auto-managed TLS
multiple machines or prod system:
- for node-node traffic:
- no special requirement on inter-node traffic: auto-managed node-node TLS
- special requirements:
  - org-mandated PKI: use org-mandated PKI to manage node-node TLS
  - user responsible for network security: no-TLS RPC
- for node-client traffic:
- no special requirements: auto-managed SQL/HTTP TLS with either password or TLS authn
  This is the most common pg-compatible scenario!
- special requirements:
  - org-mandated PKI: use org-mandated cert manager to manage SQL/HTTP TLS
  - user responsible for network security: no-TLS SQL/HTTP

Cluster is secure in all cases.

Threat model

Node-node connections

Threat \ Protection	TLS transport+authn	no-TLS + Shared secret	no-TLS + no auth (`--insecure`)	Mitigation possible via network restrictions?
Mistaken node join from wrong cluster	protected	protected	vulnerable	no
Complete cluster takeover via malicious node	protected	vulnerable via bruteforce	vulnerable	partial
Attacker MITM snoop of authn credentials	protected	vulnerable	vulnerable	yes
Attacker accesses confidential data via network snoop	protected	vulnerable	vulnerable	yes
Attacker modifies cluster data in-flight	protected	vulnerable	vulnerable	yes
Compromised authn credentials	Fix via cert rotation/revocation (online)	Update secret + full restart	N/A	partial

Note: the vulnerabilities in the 2nd and 3rd column are amplified by sharing the same TCP listener (address/port) between SQL and RPC listeners.

Partial mitigation possible via separate --sql-addr and fencing off the node-to-node traffic into a private network.

Node-client connections

Threat \ Protection	TLS transport+authn	no-TLS + SCRAM authn	no-TLS + no auth (`--insecure`)	Mitigation possible via network restrictions?
Mistaken connect from/to wrong app or wrong cluster	protected	protected	vulnerable	no
Complete cluster takeover via escalation of privilege	protected	protected	vulnerable	partial
Rogue client app unrestricted access to data of 1 SQL user	protected	protected	vulnerable	partial
Attacker MITM snoop of authn credentials	protected	protected	vulnerable	yes
Attacker accesses app data data via network snoop	protected	vulnerable	vulnerable	yes
Attacker modifies app data in-flight	protected	vulnerable	vulnerable	yes
Compromised authn credentials	Fix via cert rotation/revocation (online)	Update password (online)	N/A	no

Note that the first column in the table above assumes that clients also validate server TLS certs! Otherwise the following table applies:

Threat \ Protection	Two-way cert validation	Only srv auths client
Mistaken connect from/to wrong app or wrong cluster	protected	protected
Complete cluster takeover via escalation of privilege	protected	protected
Rogue client app unrestricted access to data of 1 SQL user	protected	protected
Attacker MITM snoop of authn credentials	protected	vulnerable
Attacker accesses app data data via network snoop	protected	vulnerable
Attacker modifies app data in-flight	protected	vulnerable
Compromised authn credentials	Fix via cert rotation/revocation (online)	still vulnerable (!)

Existing clusters previously run with "insecure mode"

determine the secure configuration that suits your needs (see above)
prepare configuration:
- create authn credentials for SQL users (can be simple password)
- prepare server flags and TLS certs that suit the needs identified above
- add flag --security-upgrade to node configs.
perform rolling restart of cluster. At this point the cluster accepts but does not mandate the secure config.
remove flag --security-upgrade from node configs
perform 2nd rolling restart. This enforces the secure configuration.

Advanced decision flowchart under the fold

![Variants of secure clusters](https://user-images.githubusercontent.com/642886/92719011-dc80e200-f362-11ea-88cd-4844702c81f2.png)

Rationale

Background

For context, --insecure does the following:

deactivates TLS handshakes for node-node connections
deactivates TLS handshakes for node-client RPC connections
deactivates TLS handshakes for node-client SQL connections over TCP
deactivates TLS handshakes for node-client HTTP connections
deactivates node-to-node authentication
deactivates HTTP authentication
deactivates SQL authentication
deactivates SQL authorization
deactivate certain SQL features, so as to not create the illusion of security

Motivation

We know from experience that customers who advocate for "insecure mode" really only care about:

the complexity of setting up TLS certificates for nodo-to-node authentication, or
they want TLS-less SQL connections because they are running a server over a privileged network, for example when the nodes and the SQL app are inside the same k8s cluster (see cockroachdb/helm-charts#228, cockroachdb/cockroach#44842).

These users certainly do not want to disable authentication and authorization, yet we also know that users do not realize that --insecure disables these internal protections inside cockroachdb.

Some users also care about point 4 because setting up TLS certs in a web browser is a pain. However since v20.1 we have a solution for those users: --unencrypted-localhost-http makes the HTTP endpoint localhost-only without TLS.

Strategy

[x] in v20.2 make --insecure deprecated
[x] Solve cockroachdb/cockroach#44842 so that users get a choice to connect with SQL over non-TLS connections - cockroachdb/cockroach#53991
[x] Solve cockroachdb/cockroach#48069 to polish the HTTP probe experience
[ ] Advance cockroachdb/cockroach#51991 to simplify the TLS usage in the common case
[ ] Possibly solve cockroachdb/cockroach#54007 to offer an online upgrade path from insecure clusters to secure
[ ] advertise these mechanisms in docs
[ ] update roachtests to always use a secure cluster, sometimes without TLS to preserve the ability to capture network activity between clients and nodes
[ ] remove support for the flag in servers in v21.1 or v21.2
[ ] remove support for the flag in clients in the release after that (to preserve the ability for a new client to connect to a server running the version before)

Epic CRDB-12037

Jira issue: CRDB-3870

knz commented 4 years ago

Related: cockroachdb/cockroach#49918

knz commented 4 years ago

cc @thtruo @aaron-crl for tracking

jasobrown commented 4 years ago

Is the intent here to move all connections to the crdb process to be secure? And by secure, i mean secure in the way that cockroachdb expects? We manage TLS as well as cert authN/AuthZ outside of crdb as a separate process on the database instances, and thus run crdb in insecure mode because it's fronted by the other process. The reason for this is our authN/authZ workflow is slightly non-standard and the effort to integrate that into crdb (as far we knew a year ago) seemed daunting.

Perhaps I can take another look at our golang support for our authN/authZ as well as the cockroach code to see if the integration could work via some extension points/plugins.

knz commented 4 years ago

Hi Jason, thank you for the feedback.

Once the --insecure flag is gone, you will be able to achieve the same as your current set up with the following configuration

use a non-TLS listener for the SQL clients (#44842)
use a HBA configuration that tells CockroachDB to trust the login principal without additional authentication (since you authenticate externally). You would then narrow down the trust rule to trust only the address of your external authentication service (and not arbitrary connections).

Meanwhile, we'll encourage you to keep node-to-node connections secure using TLS; if you prefer not to set up TLS certs manually, we'll have some new experience (see cockroachdb/cockroach#51991) which will simplify the setup.

Does that sound good to you?

knz commented 4 years ago

@jasobrown a question for you though: does your set up use a single SQL account for all operations, or do you use separate SQL accounts with different privileges? Can you outline a little bit how your authorization looks like?

edit: discussed this with jason offline

bdarnell commented 4 years ago

The word "insecure" gives the wrong impression that CockroachDB is not a secure database.

Citation needed. Are people really getting confused by this? I think the real issue with --insecure mode is that because of the lack of grpc authentication, it's more insecure than people expect.

Cluster is secure in all cases.

Security is not binary. The question is always "secure against what threats?" Some of these cases are secure against threats that others are not, so it's misleading to say that they're all equally secure (of course they may be used in systems that are secure against those threats if security is provided at other layers)

And that's really what we're talking about here - traditionally we've just had all-or-nothing, with the no-protection-at-all --insecure mode or complex manually-managed certificates. Here we're adding multiple in-between modes. And then, because some of these intermediate modes will be strictly more secure than --insecure but no more difficult to use, we'll eventually eliminate --insecure mode. I think leading with the removal of --insecure is causing some surprise among users (although more than I expected - I thought we had discouraged this mode enough that no one would be using it in production).

That flowchart scares me - there's a lot of decision points there. A lot of them are new. The goal of cockroachdb/cockroach#51991 and eliminating insecure mode is to reduce the number of options that most users experience, by steering nearly all users to the "internally-managed TLS" path. I think we need to reconsider the amount of flexibility we're providing here, and what decisions the user really needs to make.

For example, I don't think we ever want to give the "non-TLS SQL" option this much emphasis. CockroachCloud will always need to use TLS, so we must have a good user experience for TLS (perhaps by working with driver implementations on things like cockroachdb/cockroach#32932).

Existing clusters previously run with "insecure mode"

Are there enough of these that we need to provide a no-downtime upgrade path? This seems like it could add quite a bit of complexity (can you upgrade from any security mode to any other, or only certain ones?)

We know from experience that customers who advocate for "insecure mode" really only care about point 3: they want TLS-less SQL connections

This is definitely not true. There have been some recent examples of this but I think the most common reason people resist setting their clusters up securely is about all the TLS setup, especially for node-to-node.

Possibly solve cockroachdb/cockroach#54007 and/or advance cockroachdb/cockroach#51991 to enable secure RPC without manual TLS configs

This seems like a requirement, not just a "possibly".

knz commented 4 years ago

The word "insecure" gives the wrong impression that CockroachDB is not a secure database.

Citation needed.

This came up internally multiple times.

Are people really getting confused by this? I think the real issue with --insecure mode is that because of the lack of grpc authentication, it's more insecure than people expect.

This is pointed out prominently at the top of the issue description. But I'll take your point, as well as this one:

because some of these intermediate modes will be strictly more secure than --insecure but no more difficult to use, we'll eventually eliminate --insecure mode. I think leading with the removal of --insecure is causing some surprise among users.

I have adjusted the title of the issue to emphasize the new things / improvements and pulled that notion as first sentences/paragraphs in the issue description.

Cluster is secure in all cases.

Security is not binary. The question is always "secure against what threats?"

You and I know that security is not binary but the point here is that all the proposed options keep all internal security controls active, which --insecure=true does not.

Anyway you are right that we need to talk about threats. Adding a table in the issue description instead of the flowchart. Bram helped me understand that a list of threats is more easy to use than a flowchart anyway.

Some of these cases are secure against threats that others are not, so it's misleading to say that they're all equally secure

"say they're all equally secure" - that's a strawman. Nobody wrote this anywhere.

That flowchart scares me - there's a lot of decision points there. [...] I think we need to reconsider the amount of flexibility we're providing here, and what decisions the user really needs to make.

Point granted. I've reduced the scope accordingly.

For example, I don't think we ever want to give the "non-TLS SQL" option this much emphasis. CockroachCloud will always need to use TLS, so we must have a good user experience for TLS (perhaps by working with driver implementations on things like cockroachdb/cockroach#32932).

It seems clear that we need to emphasize the CC use case first and foremost, and thus preserve TLS options as the main recommendation (and main recommended scenario). However we do have serious $$$ at risk if we don't offer other options.

Existing clusters previously run with "insecure mode"

Are there enough of these that we need to provide a no-downtime upgrade path? This seems like it could add quite a bit of complexity (can you upgrade from any security mode to any other, or only certain ones?)

There are at least a few customers asking (those big accounts who didn't like our TLS and went to prod with --insecure, despite our recommendations to the countrary). But we can narrow down the scope of the upgrade path to support just those customers, yes.

We know from experience that customers who advocate for "insecure mode" really only care about point 3: they want TLS-less SQL connections

This is definitely not true. There have been some recent examples of this but I think the most common reason people resist setting their clusters up securely is about all the TLS setup, especially for node-to-node.

Point taken. Adjusted the text accordingly.

Possibly solve cockroachdb/cockroach#54007 and/or advance cockroachdb/cockroach#51991 to enable secure RPC without manual TLS configs

This seems like a requirement, not just a "possibly".

The word "possibly" only pertained to cockroachdb/cockroach#54007 (which you and I would agree is probably less important). Issue cockroachdb/cockroach#51991 is definitely critical.

knz commented 4 years ago

I also just realized that the TLS story could also be simplified by issuing a common TLS client cert for all SQL users (and give them separate passwords to distinguish them)

aaron-crl commented 4 years ago

Apologies for the late response but there are three things that strike me here:

1) --insecure should probably do a better job of explicitly stating what is insecure mode. Perhaps we should include your list below in the output from the flag (formatted to CLI of course):

--insecure does the following:

deactivates TLS handshakes for node-node connections
deactivates TLS handshakes for node-client RPC connections
deactivates TLS handshakes for node-client SQL connections over TCP
deactivates TLS handshakes for node-client HTTP connections
deactivates node-to-node authentication
deactivates HTTP authentication
deactivates SQL authentication
deactivates SQL authorization
deactivate certain SQL features, so as to not create the illusion of security

2) There's a lot here to unpack and given the potential impact to developers and customers perhaps this should be moved to an RFC given the current prevalence of this flag and depth of discussion.

3) I don't feel we should mark a feature as deprecated until we have a supported alternative (or at least support for the majority of the usage). I don't feel we have that yet given that our own documentation still includes this flag for tutorials and guides.

knz commented 4 years ago

Regarding point 1: there is now a clearer warning than before. The warning reads as follows:

*
* WARNING: ALL SECURITY CONTROLS HAVE BEEN DISABLED!
*
* This mode is intended for non-production testing only.
*
* In this mode:
* - Your cluster is open to any client that can access any of your IP addresses.
* - Intruders with access to your machine or network can observe client-server traffic.
* - Intruders can log in without password and read or write any data in the cluster.
* - Intruders can consume all your server's resources and cause unavailability.
*
*
* INFO: To start a secure server without mandating TLS for clients,
* consider --accept-sql-without-tls instead. For other options, see:
*
* - https://go.crdb.dev/issue-v/53404/v20.2
* - https://www.cockroachlabs.com/docs/v20.2/secure-a-cluster.html
*

Regarding point 2: yes this work will be in a RFC of course.

Regarding point 3: agreed

cockroachdb / cockroach