cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.97k stars 3.79k forks source link

Allow specific SQL queries to skip admission control #81415

Open logston opened 2 years ago

logston commented 2 years ago

Is your feature request related to a problem? Please describe.

To determine CRDB uptime and response latencies, a periodic SQL request can be sent to a running cluster. For example, a query like the following ...

UPSERT INTO testing_table (key, value) VALUES ('test_key', 'test_value');

... can be sent to CRDB and successful completion and the duration of the query can be recorded.

This method can work exceptionally well for the majority of cases. However, in under provisioned clusters, noise can be introduced by competing workloads. The failure of these SQL probes are not true "cluster down" failures but are false negatives due to the crowding out of the SQL probe queries by other work on the cluster.

Preferably, databases and/or tables could be marked at a higher priority allowing queries bound for only those resources to "skip the queue". In other words, a query bound for said database/table could skip admission control.

Describe the solution you'd like

I do not have sound suggestions for the API currently. But, given the fact that admission control is managed via cluster settings, perhaps a cluster setting could be used to define the database(s) that could skip admission control. Seemingly harder but more robust might be to add a new field to the ZONE CONFIGURATION logic. For example:

ALTER TABLE testing_table 
CONFIGURE ZONE USING skip_admission_control = true;

Jira issue: CRDB-15200

ajwerner commented 2 years ago

I think zone configs are a somewhat poor fit here. Admission control is relevant at multiple levels of the stack, and some of them are not going to know about the table being hit by a query. Furthermore, admission settings are more or less configured today on a per statement execution basis.

The mechanism we added to control how execution interacts with admission control is via a session variable. Currently the only thing we expose to control how a statement interacts with admission control is via setting SET default_transaction_quality_of_service=[background,regular,critical]. Now, if the user sets all of their stuff to critical this isn't going to help you. However, I have to imagine that most of the time, this will help you. One thing we can do is define more qos levels and define them such that they are above where the user can go, though we'll have to find a back-channel to authorize our periodic thing in a special way.

https://github.com/cockroachdb/cockroach/pull/76512

logston commented 2 years ago

Thanks @ajwerner! For reasons I won't get in to, I need to work on open source work for about two weeks. Is there any part of this work that you think would be tractable in that period? In other words, starter project scale? I'm effectively new to the CRDB code base and would need time to ramp up.

ajwerner commented 2 years ago

I think this can be extremely small. Firstly, let's confirm that setting the admission control quality of service level to something very high via a session variable is something that solves the problem? If it is, then have a look at the above PR, which is what introduced that mechanism. In particular, one area of focus might be: https://github.com/cockroachdb/cockroach/blob/2a75254d37a9704adc6b34ec2133755bc9fd0119/pkg/sql/sessiondatapb/local_only_session_data.go#L232-L298

You'll notice that we only allow you to specify the three levels I mentioned above: https://github.com/cockroachdb/cockroach/blob/2a75254d37a9704adc6b34ec2133755bc9fd0119/pkg/sql/sessiondatapb/local_only_session_data.go#L302-L308

Now, I suspect if we really want this to be, more or less, bulletproof, we'll want to run these queries at a priority that exceeds any user queries. The admission priority concept is represented in one byte. Its values live here: https://github.com/cockroachdb/cockroach/blob/fccce4b563d1600dd1d194e7cf07612aec8e075e/pkg/util/admission/admissionpb/admissionpb.go#L21-L39

I guess what we want is to add a 4th qos level which maps to an admission priority somewhere between LockingPri and HighPri. Then, we'd want to restrict that value to only some special user principle. This work falls somewhere between SQL queries who shepherds admission control integration and SQL experience who owns sessiondata and also user authorization.

The tricky thing will be that the context in which we set this value is a context in which we have a bunch of session information, but not everything.

https://github.com/cockroachdb/cockroach/blob/f2092b69ee0ae1ceb65cc7fcf0c0012da6ece871/pkg/sql/vars.go#L1945-L1949

I guess the question is, what should exactly be the restriction on who should be able to use the higher than high priority level? Should all admins? If so this is easy, we have access to that information here in:

m.data.IsSuperuser, but if we want something else, then we'll have to add a boolean or something.

I think I probably just said more words than the bulk of the patch will be. Assuming I understand the goals and constraints (which is a very big assumption), I think this should be a rather small change.

logston commented 2 years ago

This is great direction @ajwerner! Thanks! I'll start looking into this on Tuesday of next week.

joshimhoff commented 2 years ago

Currently the only thing we expose to control how a statement interacts with admission control is via setting SET default_transaction_quality_of_service=[background,regular,critical].

Good point. We should update sqlprober to use critical ~now.

I guess what we want is to add a 4th qos level which maps to an admission priority somewhere between LockingPri and HighPri.

Can you explain your rationale for this suggestion? Will such a value result in skipping admission control queues completely like kvprober? What is sent at LockingPri & HighPri exactly as of today?

I guess the question is, what should exactly be the restriction on who should be able to use the higher than high priority level? Should all admins? If so this is easy, we have access to that information here in:

We give CC users admin, so I think that's out. Other than root, there is not a CC user perm level that is higher than our users, and we shouldn't use root for sqlprober.

If this ticket is really only useful for CC, we could introduce a CRBD CLI flag / env var that designates a specific SQL user as able to set QOS to the new value, as CC users can't set CLI flags / env vars today. A bit ugly but it would work.

Perhaps a cleaner solution relies on dedicated + serverless being unified, as then customers won't have access to the system tenant? Else, I think we maybe want a SQL perm level that is not root but higher than our users have. Either way, I don't think we want to depend on future work like this.

logston commented 2 years ago

Thanks for the input @joshimhoff !

Like I mentioned, I'm still wrapping my head around the crdb code base so I have several questions. Here's what I understand so far; correct me if I'm wrong.

- kvprober is able to, but is not required to, skip admission control

After looking at the kvprober code, it seems we can skip admission control (AC) by creating a transaction with TxnRootKV. TxnRootKV skips AC by directly calling runTxn that in turn calls Txn.exec; https://github.com/cockroachdb/cockroach/blob/709881d4a1390ad68a7ddc2700c15ef6b195d928/pkg/kv/db.go#L934-L945 This function has a very specific demand in its doc string...

Do not use this for executing work originating in SQL

One question I have here is why is the variable bypassAdmissionControl named so? It seems that setting kv.prober.bypass_admission_control.enabled to true would cause kvprober to skip AC. However, when I look at the code, it seems that only time we use a TxnRootKV is when kv.prober.bypass_admission_control.enabled to false. I must be missing something. @joshimhoff, can you point out what I'm missing here in terms of context?

Assuming a path with AC enabled, the call to Txn will create a call to TxnWithAdmissionControl which handle setting all the AC params on the Txn.

- We can't, and probably do not want to, send queries originating in SQL through TxnRootKV

If we wanted to allow some SQL queries to skip admission control just like kvprober does, we would need to expose TxnRootKV or something like it. I am assuming that is something we do not want to do given that opens up a path to subvert the protections that admission control was designed to maintain. Thus, we are left with using a very high level of priority (as suggested by @ajwerner) and the AC system.

- A new QoS level

If we create a new level QoS level, let's call it InternalNormal, that will allow some queries to be queued with a higher priority than normal user queries, we might be able to assign said level to the queries coming in from a specific user. However, the means for assigning this QoS level to a specific user are not yet created; thus, @joshimhoff's suggestion of an env var.

In my very limited experience, creating a QoS level that only predefined users can access seems pretty sound. Means of defining the the users that have access that I know of are cluster settings, command line arguments, env vars, and SQL connection string parameters (with same some special token that permits access to the QoS level). Out of all of these, use of a cluster setting sounds the best but only in a world where normal users can not change their permissions; ie Serverless as Josh pointed out.

I think the next best option after a cluster setting set within the system tenant would be an env var. The reason being is that the introduction of new env vars in CC has had a smoother history than the introduction of new CLI flags. I'm ignoring connection string params as a valid option given all of the security burden that come to mind.

- Draft design of the use of a new env var

My tentative plan is to add something like this somewhere...

        allowedInternalUsers := settings.RegisterStringSetting(
        settings.SystemOnly,
        "cluster.allowed_internal_users",
        "list of users allowed to use internal QoS levels",
        envutil.EnvOrDefaultString("COCKROACH_CLUSTER_ALLOWED_INTERNAL_USERS", ""),
    )
    internalQoSLevels := settings.RegisterStringSetting(
        settings.SystemOnly,
        "cluster.internal_qos_levels",
        "internal QoS levels that users defined by allowed_internal_users may assume",
        "internal_normal",
    )

This would allow us to pass COCKROACH_CLUSTER_ALLOWED_INTERNAL_USERS=[cc_internal_user] to the cockroach process in CC. I would need confirmation that dedicated clusters would not be able to change this cluster setting OR would need to check env var directly when setting the default_transaction_quality_of_service session var. Next, the logic for setting default_transaction_quality_of_service would need to be updated to ensure that QoS is set to a valid value allowed by internalQoSLevels. I'm thinking of placing this in pkg/sql/set_var.go.

I'm far from sure this is the best solution, but given that any coding work on this would be good practice for me in working with the CRDB code base, I am going to give it a go. I'll check back in later to see what the thoughts are on my previous statements.

ajwerner commented 2 years ago

Let me see if I can coax @msirek into opining here or pulling somebody else into the fold; @cockroachdb/sql-queries is likely to end up holding the bag on whatever we do here.

joshimhoff commented 2 years ago

Thanks, @ajwerner! If you're feeling busy & want me to do the poking, just let me know.