multitenant: Certificates for running debug zip

rimadeodhar commented 2 years ago

Today, when we run a CLI command which requires a SQL connection for multi-tenant, we require the a root client certificate for opening up a SQL connection to a tenant server. However, a root client certificate may not always exist for a multi-tenant cluster. For example, for serverless deployments, we don't have access to the root client certificates on tenants for security reasons.

In this issue, we consider the alternative approaches possible to make debug zip work for multi-tenant deployments:

Allow access to root client certificates for all multi-tenant deployments: This will require no changes to be made server side (CRDB) but may have security implications for serverless as we may not want to have access to root client certificates within the serverless pods. The advantage is that it keeps the current working consistent between all CRDB deployments.
Use tenant client certificate as a replacement for root client certificate by adding username to a tenant certificate: In this option, we will need to modify the certificate generation code to allow a username to be added while minting a tenant certificate. This will allow us to use a tenant certificate as a client certificate for the root user while running debug zip. In addition, we will also need to remove the assertion within the certificate authentication code which currently does not allow for a tenant certificate to be used in lieu of the client certificate. We will need to modify this assertion to allow using the tenant certificate as a client certificate as long as the username contained within the certificate matches.
Use the tenant client certificate as is: In this final option, we use the tenant certificate as-is for root user. However, this will require special casing within the authentication code to allow using a tenant certificate as client certificate as long as the user is root even if the certificate does not contain a match to the username.

Some other options we considered but haven't fleshed out yet:

Certs on command line: Passing in a client certificate through command line so we don't need to store the root client cert on disk.
Allow debug zip to extracted through RPC instead of CLI.
Allow all debug zip functionality over a SQL connection instead of CLI.

We need to discuss the security implications of all of these approaches in order to finalize a solution on how we can make debug zips work for multi-tenant deployments.

Jira issue: CRDB-13873

rimadeodhar commented 2 years ago

cc @knz, @andy-kimball, @catj-cockroach

knz commented 2 years ago

I wonder if we could issue root client certs just for the CC superuser console, and have those then passed to the debug zip tool on the command line. So they wouldn't need to be stored on the SQL pods.

andy-kimball commented 2 years ago

Yes, if we had a way to pass certificates on the command line, then we can pass the root client certs without storing in the SQL pods. Kubernetes has a way to "exec" commands that we already use in Intrusion.

bdarnell commented 2 years ago

I think we need to reconsider what a debug zip even means in a multitenant context. We shouldn't make it possible to get a lot of the kv-level debug data via a single-tenant pod, so it's already going to be a fairly different implementation. The debugging workflow for multitenancy may require two steps to get debug information from the pod and then separately to get information from the host cluster. Since the information available to the per-tenant pod is going to be much smaller than what's in our traditional debug zip, it seems like it should be more feasible to expose what we need via sql instead of the hodgepodge of interfaces used by debug zip today. (moving everything to sql interfaces is a long-term goal, but there's a lot more legacy mess in the lower-level parts of the system).

More fundamentally, is a zip file really the right model? We introduced the debug zip system for self-hosted deployments in order to get as much information from the customer as possible in one round trip. It shouldn't be a primary debugging tool in cloud - everything that's in a debug zip should be accessible in some other way, whether that's internal SQL tables or DB console pages.

knz commented 2 years ago

@bdarnell you're right, and all your recommendations are already considered in the roadmap. What we're chatting about here is how to get some observability in our existing deployments in the short term, ahead of implementing the more general solution.

bdarnell commented 2 years ago

The three main options presented here look more or less equivalent to me, security-wise (in all cases the end result is a certificate on disk that can be used to log in as root, whether that cert is properly a client.root.crt or a modification/reuse of the tenant cert). Given that, the first option seems the best of the three because it is the easiest (and it's the easiest to reverse when we no longer need it).

Yes, if we had a way to pass certificates on the command line, then we can pass the root client certs without storing in the SQL pods. Kubernetes has a way to "exec" commands that we already use in Intrusion.

You mean passing a base64 (or similar) version of the key in the command line? That seems to have a high risk of getting logged somewhere that we don't want it to be. I wouldn't want to rely on the command line for higher-value secrets.

catj-cockroach commented 2 years ago

The three main options presented here look more or less equivalent to me, security-wise (in all cases the end result is a certificate on disk that can be used to log in as root, whether that cert is properly a client.root.crt or a modification/reuse of the tenant cert). Given that, the first option seems the best of the three because it is the easiest (and it's the easiest to reverse when we no longer need it).

I agree, and I'd also argue that further complicating our already very complicated TLS code increases the risk of a misconfiguration.

I think our best medium term option would be to implement a RPC, SQL, or HTTP call that returns the debug.zip and having Intrusion access that call with a local certificate.

knz commented 2 years ago

Ben says:

You mean passing a base64 (or similar) version of the key in the command line? That seems to have a high risk of getting logged somewhere that we don't want it to be. I wouldn't want to rely on the command line for higher-value secrets.

What do we think about that? @andy-kimball @catj-cockroach ?

JeffSwenson commented 2 years ago

I was under the impression option 1 is a non-starter. If we put a root certificate on the sql pod, compromising the sql pod would provide root access to the kv layer. Or is there something in the certificate that prevents the kv layer from accepting it?

andy-kimball commented 2 years ago

Jeff is correct. Putting a root certificate in a SQL pod would allow root access to the KV layer, since we use the same CA in both cases.

knz commented 2 years ago

Jeff, option 1 goes with the idea to not store that root cert on the pod itself and instead store it somewhere else (presumably, in vault) so it's only visible to the SU console.

The question is then how to feed this to the command. We're proposing to have the command read it from a command line argument (or, now I think of it, an environment variable) and not from a file.

I also think back of my idea to run the client command from a different machine, not on the pod itself. Why did we reject that option?

bdarnell commented 2 years ago

If a root cert on the sql pod is a non-starter, how are options 2 and 3 any better? Doesn't that just make the tenant cert that's there equally dangerous?

I will reiterate my call to look at exactly what value debug zips are providing and how we might provide that value in a better way. There may not be a quick fix to use our legacy debug interfaces in an appropriately secure way, and I do not believe that a proper solution needs to be a huge amount of work.

JeffSwenson commented 2 years ago

Creating a separate pod or host to run the command is possible, but there is nothing like that today. I suspect it would end up being a ~month of work and is unlikely to be prioritized in the short term.

What if we ran the debug zip command on one of the pods dedicated to the kv layer? They have the root certificate and network access to the tenants.

catj-cockroach commented 2 years ago

Unless the Tenant certificate operates as a distinct user, using it as a root client certificate would be worse than having the root certificate in the pod as we could not remove it in the future. I'd agree that the best solution would be to have a distinct call to get this information.

You mean passing a base64 (or similar) version of the key in the command line? That seems to have a high risk of getting logged somewhere that we don't want it to be. I wouldn't want to rely on the command line for higher-value secrets.

Agreed. I would argue that this is only useful as a stopgap, we should not treat this as a solution. @andy-kimball and I spoke about an idea I have for a separate Job that would run in the same namespace that could access the CRDB pods over gRPC and SQL which I think would be a far better option while we don't have separate client certificate authorities for each tenant.

rimadeodhar commented 2 years ago

Based on the discussion here, it seems like debug zip over RPC seems like safest solution. @knz what are your thoughts on this approach? It feels like a medium scoped project and aligns with our long term goal of making providing all of the debug zip functionality over RPC/SQL. @catj-cockroach do we have a way to access CRDB pods over gRPC today or would we need to implement this separate job idea that you mentioned earlier? How much of an effort do you think that will entail?

knz commented 2 years ago

that project would be great. However the technical challenge will be to ensure that the zip data is streamed and not accumulated in RAM.

Currently the debug zip code accumulates a lot of data in RAM, in some cases in an amount proportional to the amount of data stored in the cluster. So larger cluster = more RAM utilization in the debug zip CLI process.

If we move that logic to server-side, we'll want to be very careful about not incurring that RAM utilization server-side. That's important because usually by the time we need to use debug zip the cluster is unhealthy and excessive RAM usage could tip it over into unavailability.

bdarnell commented 2 years ago

Which server would be serving this RPC? How is the RPC authenticated?

knz commented 2 years ago

my presumption is that the SQL pod would be serving the RPC, and perform a fanout to other SQL pods to collect data from currently live pods. Authentication using a regular root client cert, stored in CC SU console.

bdarnell commented 2 years ago

I'm confused. How is the SQL pod going to be able to use this root certificate if we we've established that we're not willing to let a SQL pod have a root cert? I think I need a refresher on the different components and security boundaries involved here.

knz commented 2 years ago

The SQL pod would "have" a root cert but it would not be stored on the pod itself.

The threat model we have is that the pod running the SQL server can be compromised, and we don't want that pod to have access to a valid root client cert and key files.

However we can have that pod know of a CA which can verify a client conn made with that client cert, from another component. The idea was to have SU console perform the "zip RPC" using the root client cert to authenticate. The cert would be loaded by SU console from Vault and would never be copied to the SQL pod.

bdarnell commented 2 years ago

But if each tenant has its own user namespace, isn't a tenant-scoped root client cert roughly equivalent in power the tenant's cert that the sql pod uses to authenticate to the kv layer?

However we can have that pod know of a CA which can verify a client conn made with that client cert, from another component. The idea was to have SU console perform the "zip RPC" using the root client cert to authenticate. The cert would be loaded by SU console from Vault and would never be copied to the SQL pod.

Anyway, this seems like a good design from a security perspective.

knz commented 2 years ago

isn't a tenant-scoped root client cert roughly equivalent in power the tenant's cert that the sql pod uses to authenticate to the kv layer?

Yes, we started with this, but we found it's rather hard to change the code to accept that tenant cert coming from a CLI tool that needs to use SQL (RPC is not a problem, but the zip utility wants a SQL connection currently). That difficulty is what warranted exploring alternatives.

bdarnell commented 2 years ago

Yes, we started with this, but we found it's rather hard to change the code to accept that tenant cert coming from a CLI tool that needs to use SQL

Ack. But my point wasn't that we should use this tenant cert for debug zips, it's that because we keep this tenant cert on the SQL pod, we shouldn't rule out storing a (tenant-scoped) root cert there too.

rimadeodhar commented 2 years ago

I have another proposal on using certificates for enabling debug zip access for serverless clusters (inspired by @bdarnell's comment on tenant-scoped root cert). What I'm proposing is:

Creating a tenant-scoped root certificate which a serverless pod has access to. This certificate will contain the tenant id within the CN field and client id as SAN.
We will use this certificate as the client certificate during the SQL connection handshake. The authentication of the cert takes place on the server side. So the server side authentication code will need to be updated to handle this new certificate type. The authentication code will enforce both the tenant id and username contained within the cert matches. Since the authentication takes place on the server, we have access to the tenant id for the SQL pod. This way the certificate can only be used to authenticate a user for a particular tenant. It cannot be used to authenticate with another tenant cluster or even the KV layer since the tenant id match component in the auth code will reject the connection.

One advantage of this solution is that it allows us to use the debug zip code as is today and also addresses security concerns over the usage of a root client certificate on a serverless cluster by limiting the power of the cert. I also believe it addresses @catj-cockroach's concern of not complicating the TLS config code any further since this logic is straightforward to understand and implement.

@bdarnell, I do want to clarify that this solution is only meant as a secure, short-term solution to unblock observability for our existing serverless clusters. I agree with your larger point on providing better alternatives to debug zip in the cloud environment and this is being accounted for within our roadmap. However, unlocking access to debug zip like data is in the critical path for serverless observability and providing this interim solution will help us buy more time to plan and execute on the larger project of alternatives for debug zip.

knz commented 2 years ago

I like this proposal.

bdarnell commented 2 years ago

One advantage of this solution is that it allows us to use the debug zip code as is today and also addresses security concerns over the usage of a root client certificate on a serverless cluster by limiting the power of the cert.

Are client certs not naturally tenant-scoped by the use of a per-tenant ca-client.crt? I guess we wouldn't have had to build that yet since we don't otherwise use cert auth for clients. This is probably something we should consider.

Overall this sounds like a good plan. I would even make the tenant-restriction field mandatory when using a non-system tenant (I'd do it for the system tenant too but that will take more care to phase in). The one nit I have is that we don't want to be packing more information into the CN. This should probably go in another SAN or maybe @catj-cockroach has a better idea about the proper way to encode this.

catj-cockroach commented 2 years ago

I like your proposal @rimadeodhar! I agree with @bdarnell that we shouldn't use the CN to validate the tenant ID though. Instead, I'd argue we use a URI Subject Alternative Name to mark a client certificate as coming from a tenant (I proposed crdb://tenant/x where x is the tenant ID, in my TLS Simplification RFC).

Are client certs not naturally tenant-scoped by the use of a per-tenant ca-client.crt? I guess we wouldn't have had to build that yet since we don't otherwise use cert auth for clients. This is probably something we should consider.

Not presently, but @andy-kimball and I have been discussing implementing per-tenant CA certificates. I believe that starting with a per-tenant ca-client.crt would require very little time to implement compared to rolling out per-tenant CA certificates and would allow us to use the certificate auth that exists in CRDB today. Unfortunately, if the KV layer is where the gRPC/debug zip logic exists, we would have to trust all the Tenant CA certificates in the KV layer to allow connections to it, which would put us back in the same place as having a globally scoped client.root.crt in the pod.

rimadeodhar commented 2 years ago

The one nit I have is that we don't want to be packing more information into the CN.

Agreed! My original plan was to put client ID as SAN and tenant id with CN. But I'll go with @catj-cockroach's suggestion to use tenant ID as URI SAN to scope the client certificate to a tenant.

Thanks everyone for the discussion and feedback. This was very helpful. Since we have consensus on this, I'm going to go ahead and get started on the work here.

knz commented 2 years ago

cc @jtsiros for tracking

mari-crl commented 2 years ago

:information_source: Hello! I am a human and not at all a robot! Look at my very human username! :robot: :notes: :thinking: Although I tried very hard to figure out what to do with this issue, more powerful human brains will need to help me. (specifically: Both Github and Jira issues were modified after exalate problems) :confounded: :arrows_counterclockwise: Please visit this issue's mirror at CRDB-13873 and try to sync the two sides up manually. :star2: :white_check_mark: When you're finished, comment saying as much asn a member of Developer Infrastructure will be along to finish linking. :link: :no_entry_sign: Note that until this is done, this issue is not and will not be synced to Jira with Exalate. :no_entry_sign: :sweat_smile: Feeling lost? Don't worry about it! A member of @cockroachdb/exalate-22-cleanup-team will be along shortly to help! :+1: :construction_worker: Developer Infrastructure members: when ready, open Exalate from the right-hand menu of the mirror issue in Jira, then choose Connect and enter this issue's URN: cockroachdb/cockroach-77958. Either way, delete this comment when you're done. :key: :pray: Thank you for your compliance, my fellow humans! :robot: :wave:

rail commented 2 years ago

Manually synced with Jira

cockroachdb / cockroach

multitenant: Certificates for running debug zip #77958