irfansharif commented 3 years ago

66348 outlined a scheme to support zone configs for secondary tenants. Since the unit of what we can configure in KV is a Range, tenants now being able to set zone configs grants them the ability to induce an arbitrary number of splits in KV -- something we want to put guardrails against (see RFC for more details).

There's some discussion elsewhere for how we could achieve this, copying over one in particular:

We could have each pod lease out a local quota, counting against the tenant's global limit. For N pods of a tenant, and a limit L, we could either lease out partial quotas L/N, or have a single pod lease out L entirely and require all DDLs/zcfgs to through that one pod. The scheme below describes the general case of each pod getting some fractional share of the total quota.

When each pod considers a DDL/zcfg txn, before committing it will attempt to acquire a quota from its local leased limit. The quota acquired is an over-estimate of how much the txn will cost (in terms of additional splits induced); we cannot under-estimate as KV could then later reject the committed zcfg.

If there's insufficient quota, we'll abort the txn and error out accordingly.

If there's sufficient quota, we'll commit the txn and rely on the eventual reconciliation to reliably succeed. We'll queue up a work item for the for the reconciliation job, marking alongside it the lease expiration of the quota it was accepted with.

When the local pool runs out of quota and has to request/lease more, it needs to ensure that all the txns that have been committed have been reconciled. We don't want it to be the case that a quota is leased, allocated to an operation in the work queue, it stays in the work queue past its lease expiration date, and we're able to lease the "same" quota again and queue up a second work item that will now get rejected. We need to make sure that when we lease a limit at t, there are no unprocessed work items with leased tokens expiring before t.

Whenever the work item is processed, the amount we over-estimated by could either be returned to the local quota pool (with the same expiration ts), or we could simply drop it entirely and rely on the local quota pool to eventually re-acquire it from the global pool.

The RFC also captures a possible schema we could use on the host tenant to store these limits:

CREATE TABLE system.spans_configurations_per_tenant (
    tenant_id               INT,
    max_span_configurations INT,
)

The actual RPCs/interfaces are TBD. Probably we also want the ability to configure these limits on a per-tenant basis.

Epic CRDB-10563

Jira issue: CRDB-10116

irfansharif commented 2 years ago

This is something we'll try tackling (or at least prototyping) in November. A brief plug for why it's important:

After #67679, enabling the span configs infrastructure by default (something we're also looking to do over November) would mean that for every single schema object in a multi-tenant cluster, we would be inducing a range split. This is exactly what happens with the host tenant today where every table/index/partition gets its own range. Before #67679, for each tenant we started off with just a mega-range containing the entire tenant ID prefix.

--- Seeing range boundaries on the host tenant, with a single secondary tenant pre-67679. The last one is of interest.
> SELECT range_id, start_pretty, end_pretty from crdb_internal.ranges;
  range_id |         start_pretty          |          end_pretty
-----------+-------------------------------+--------------------------------
         1 | /Min                          | /System/NodeLiveness
         2 | /System/NodeLiveness          | /System/NodeLivenessMax
         3 | /System/NodeLivenessMax       | /System/tsd
         4 | /System/tsd                   | /System/"tse"
         5 | /System/"tse"                 | /Table/SystemConfigSpan/Start
         6 | /Table/SystemConfigSpan/Start | /Table/11
         7 | /Table/11                     | /Table/12
         8 | /Table/12                     | /Table/13
                                        ...
        42 | /Table/46                     | /Table/47
        43 | /Table/47                     | /Tenant/123
        44 | /Tenant/123                   | /Max

After #67679 + enabling the infra.

> SELECT range_id, start_pretty, end_pretty from crdb_internal.ranges;
  range_id |          start_pretty          |           end_pretty
-----------+--------------------------------+---------------------------------
                                        ...
        42 | /Table/46                      | /Table/47
        43 | /Table/47                      | /Tenant/123
        44 | /Tenant/123                    | /Tenant/123/Table/3
        45 | /Tenant/123/Table/3            | /Tenant/123/Table/4
        46 | /Tenant/123/Table/4            | /Tenant/123/Table/5
        47 | /Tenant/123/Table/5            | /Tenant/123/Table/6
                                        ...
        83 | /Tenant/123/Table/45           | /Tenant/123/Table/46
        84 | /Tenant/123/Table/46           | /Tenant/123/Table/47
        85 | /Tenant/123/Table/47           | /Max

Having O(descriptors across all tenants) ranges in the system is going to be difficult for us to work with, and can readily be used as an attack vector. We'd either need to impose a hard limit on the number of descriptor objects a tenant is allowed to have, or do what this issue suggests -- introducing a limit based on the number of individually configurable spans. This is still a pretty rudimentary safeguard, but fear not, we can do much better on the KV side to move away from always splitting on descriptor boundaries: #72389.

(+cc @shralex, @nvanbenschoten, @ajwerner, @arulajmani)

irfansharif commented 2 years ago

In the short-term (22.1) all we'll do is maintain a counter for secondary tenants and for every schema change, use a translator-like component to determine how many splits are being added/removed. We don't need to look at the corresponding zone config for the descriptor as that has no bearing here. We'll inc/decrement our counter accordingly, comparing it against a per-tenant limit exposed by the host -- perhaps using a "tenant read-only cluster setting", using the verbiage from multi-tenant cluster settings rfc. If we're over the limit, we'll abort the schema change txn. KV of course would still maintain its internal counter for the number of splits/etc per-tenant, and use that to reject span config updates if they go overboard. For unmodified tenants, we don't expect to run into these errors.

This scheme, in contrast to the one described above, has the downside that it would serialize all schema changes. That feels reasonable for the short term, especially considering it only applies to secondary tenants. Incorporating leasing in any form would purely be a performance optimization -- to have real schema change concurrency.

ajstorm commented 2 years ago

How would this work for internal APIs like AdminSplit? Would they get caught by this mechanism and tracked accordingly? Or would they enter KV below the "schema change"-level tracking? I'm asking because we may be able to enable AdminSplit directly if it gets tracked by this scheme.

irfansharif commented 2 years ago

we may be able to enable AdminSplit directly if it gets tracked by this scheme.

I don't think AdminSplit can directly leverage the mechanisms described above, at least not in this current form, and not without additional leg work. AdminSplits are issued directly to KV as a non-transactional requests (follow code here). The SQL side counter we're maintaining is done so transactionally, in the tenant's own keyspace. Crucially, it's maintained without ever having to ask KV about its own trackers. To track an AdminSplit request (these typically have expiration timestamps) we'd have to maintain this SQL side counter in tandem with issuing the request to KV, where it may or may not be rejected (some form of 2PC maybe?). In addition, once the AdminSplit request has expired, the SQL side counter would probably want to reflect the change -- it would perhaps need to reach out to KV to reconcile what the current state of those requests are. This is all doable with some effort, and there might be simpler approaches, but I'm not sure if it's quite as "drop-in" with everything else this issue is focused on.

I know we recently filed #74389, should we brainstorm there? It's not yet clear to me yet that even want to carry forward AdminSplit/Scatter as an API, let alone extend it to secondary tenants. There was some KV internal discussion here on the subject.

ajwerner commented 2 years ago

More internal discussion on AdminSplit and "advisory splits" here.

blathers-crl[bot] commented 2 years ago

Hi @irfansharif, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

kenliu-crl commented 2 years ago

manually reviewed and brought up to date

cockroachdb / cockroach

kvserver: limit the number of kv spans configurable by a tenant #70555