sql: distribute UPDATE/DELETE statements in DistSQL flows

nvanbenschoten commented 4 years ago

UPDATE and DELETE statements (ignoring fast-paths) both perform an initial row-fetch to retrieve the current value a row before issuing their writes. From the perspective of KV, this leads to the following sequence of operations:

UPDATE:
 Scan(k)
 <some logic on gateway>
 Put(k)

DELETE:
 Scan(k)
 <some logic on gateway>
 Delete(k)

This works well and we've put work recently intro acquiring locks during the Scan operation to improve throughput under contention. However, this leaves room for improvement in terms of latency in geo-distributed clusters where the leaseholder is not local to the gateway node.

Imagine a topology where gateway<->leaseholder RTT take 20ms and replication takes 20ms. This may look like the following:

[replica/gateway] < --- 20 ms --- > [leaseholder] < --- 20 ms --- > [replica]

In this topology, both the Scan and the Put/Delete need to pay the 20ms gateway<->leaseholder latency cost. So while an INSERT into this topology takes 40ms (a blind CPut), an UPDATE or DELETE takes 60ms. Ideally, these operations should also take 40ms.

We've talked about pushing complexity into the KV API to allow for some form of "masked put" that could combine the Scan and Put operations, but this seems misguided and inflexible. Especially now that we can hold cheap locks between the Scan and the Put, it would be better to avoid expanding the KV API just to relocate the intermediate logic. We already have infrastructure to distribute / "push down" computation in a cluster to improve locality - DistSQL. We should use that here.

In the past, we've talked about DistSQL controlled mutations as a means of improving throughput. It would allow us to run MapReduce-style dataflow jobs in service of expensive SQL statements like INSERT INTO t2 (SELECT * FROM t1). This seems important, but not nearly as important as using DistSQL to distribute mutations for the purpose of reducing latency. We know customers that would immediately benefit from the latter improvement.

Blockers

I'm not an expert in this area, so there are certainly things I am missing, but here are the two blockers to this work that I'm immediately aware of:

we don't allow mutations on LeafTxn coordinators. This seems tricky to get right, especially in the area of refreshes and intent tracking, but this is certainly tractable.
~it's my understanding (I may be wrong about this) that DistSQL is optimized for throughput and not latency. The most obvious example of this is that DistSQL takes an entire RTT just to set up a flow, before actually starting it. For low-latency requirements, we absolutely need to establish and start the flow in a single pass. This is a problem that we should fix independently of what's written here, but it's also a hard blocker for any of this to be worth it.~

cc. @jordanlewis @RaduBerinde @andreimatei

Jira issue: CRDB-4215

gz#16113

RaduBerinde commented 4 years ago

2 is not accurate, the instantiation and starting of the flow are not separate steps; it's just one API call (SetupFlow).

nvanbenschoten commented 4 years ago

2 is not accurate, the instantiation and starting of the flow are not separate steps; it's just one API call (SetupFlow).

@jordanlewis or @asubiotto do you mind confirming before I cross this blocker off? I have a distinct memory of having a conversation with one of you a few months ago about this exact question.

Put another way, if I have a leaseholder across a 100ms RTT link and I decide to put a TableReader on that node to satisfy a point-lookup SELECT statement, how long will that statement take to run from start to finish? ~100ms or ~200ms?

andy-kimball commented 4 years ago

+@rytaft who has been thinking about optimizing latencies.

nvanbenschoten commented 4 years ago

@andreimatei and I discussed this today and while this all mostly checked out, he noticed that we'll need to continue to hit the 1PC path in implicit transactions for this to be beneficial. That means that we'll need to have the remote LeafTxn coordinator commit the transaction. Or we'll need to establish the RootTxn on a different node than the gateway, perhaps when we can prove that the transaction will not outlive the DistSQL flow. Neither of these seems like great options, but we'll have to do something here.

andy-kimball commented 4 years ago

This is exactly the kind of thing that the optimizer should be doing. It should be scheduling the distribution of operators across the cluster, based on costing (latency and throughput). This is another great use case to consider as we design this capability.

asubiotto commented 4 years ago

@jordanlewis or @asubiotto do you mind confirming before I cross this blocker off? I have a distinct memory of having a conversation with one of you a few months ago about this exact question.

@RaduBerinde is right. The SetupFlow RPC sets up and starts the flow on a remote node which then immediately starts pushing rows to the gateway. I think we had a conversation about a different part of the setup process.

knz commented 4 years ago

we don't allow mutations on LeafTxn coordinators. This seems tricky to get right, especially in the area of refreshes and intent tracking, but this is certainly tractable.

Regarding this point specifically. There is a potpourri of obstacles that need to be addressed; they are covered and illustrated in the txn coord sender tech note (in docs/tech-nodes)

nvanbenschoten commented 4 years ago

@RaduBerinde is right. The SetupFlow RPC sets up and starts the flow on a remote node which then immediately starts pushing rows to the gateway. I think we had a conversation about a different part of the setup process.

That's great news. I've crossed that blocker off then. We're one step closer.

Do you mind reminding me what that conversation was about? What other part of the setup process is there?

There is a potpourri of obstacles that need to be addressed; they are covered and illustrated in the txn coord sender tech note (in docs/tech-nodes)

txn_coord_sender.md gives a helpful overview, thanks for pointing that out.

asubiotto commented 3 years ago

Regarding allowing mutations on leaf txns, what needs to be done there? cc @andreimatei

andreimatei commented 3 years ago

Regarding allowing mutations on leaf txns, what needs to be done there?

To allow mutations to be distributed (ran concurrently by multiple nodes) we'd need a lot of things - some story about collecting the in-flight writes from all participants, allocating sequence numbers in a distributed way somehow. I think a lot of the TxnCoordSender logic would need to change; I don't think we're close to that. This issue I guess only wants a more restricted case, where only one node performs the mutations, but that node is not the gateway. And I guess there's two sub-cases: the remote node commits the txn (for 1PC in implicit txns) and the remote node doesn't. When the remote node commits the txn, I think the remote node should just use a root txn. When it doesn't commit, I think it should still use a root txn and maybe we can extend the code we have for savepoint support to marshal back and forth all the transaction's data.

mgartner commented 3 years ago

cc @awoods187 @kevin-v-ngo for prioritization

nvanbenschoten commented 3 years ago

This is coming up in a few multi-region customer deployments. Customers are seeing that contended single-statement UPDATEs where load originates from multiple regions (so there's no ideal leaseholder location) perform quite poorly.

It's clear to see why from the original discussion here. Each transaction jumps from the gateway to the leaseholder, acquires unreplicated locks on the contended row while Scanning, returns to the gateway, then jumps back to the leaseholder to issue a combined Put and EndTxn batch. This trip back to the gateway between the Scan and the Put balloons the contention footprint of these statements, which compounds as concurrency grows. Worse, we've seen cases where this all takes so long that the transaction needs to refresh between the Put and the EndTxn, which requires another 3 gateway/leaseholder round-trips - one to refresh, one to Put, and one to EndTxn (because we split the Put and the EndTxn after the refresh).

All of this is begging for us to ship the UPDATE to the remote node.

awoods187 commented 3 years ago

Do we have an estimate on the work required to support this item? Do we view it more as a SQL queries or a multi-region item? Let's discuss during planning @rytaft

knz commented 3 years ago

Can't speak for SQL (any more), but this requires quite a few moving pieces in KV as well.

The main issue is that we're currently assuming in KV that there can be at most one concurrent originator of write intents for a given txn. Distributing DML statements will require to change this assumption.

Andrei used to be the person with the clearest picture of what needs to happen, but Andrei is on PTO right now. Maybe @nvanbenschoten knows as well.

knz commented 3 years ago

There is a special case, which Nathan outlines above: instead of distributing DML statements (so they execute on multiple nodes concurrently, but that is hard as per my explanation above),

we could move the entire DML statement to just one other node that's closer to the data being modified.

This would still require some KV changes (in kvclient) but fewer. It would require some more intricate refactors in the SQL execution code to guarantee that there is at most one RootTxn active at a time.

nvanbenschoten commented 3 years ago

instead of distributing DML statements (so they execute on multiple nodes concurrently, but that is hard as per my explanation above), we could move the entire DML statement to just one other node that's closer to the data being modified.

@jordanlewis and I discussed this yesterday. In a lot of ways, it simplifies the proposal here. Instead of parsing, planning, decomposing, and beginning execution of a mutation and then shipping enough state to a remote node so that we can commit from there (this is absolutely critical), we could proxy the entire SQL statement to the remote node when appropriate.

I don't know what would be best in terms of the representation of the mutation when proxying it to the remote node - it could just be the raw SQL string, or something a little more structured and closer to the eventual query plan. But I think the key difference between this and what we currently think of with DistSQL is that this proxied statement would not be bound to a transaction owned by the originating node. Instead, the transaction would be created on the remote node and its entire lifecycle would be managed by that remote node.

knz commented 3 years ago

Instead, the transaction would be created on the remote node and its entire lifecycle would be managed by that remote node.

This would exclude migrating a DML statement that's part of an explicit SQL txn. That's a pretty major restriction!

Is there a way to ship the entire executor state to the remote node, and then ship it back when the stmt completes?

nvanbenschoten commented 3 years ago

This would exclude migrating a DML statement that's part of an explicit SQL txn. That's a pretty major restriction!

This is a very good point. It would be such a severe restriction that it would very much change what we are building entirely. And yet, given that this is primarily a concern about contention (there's also a raw latency concern, but it's less important IMO), there are rapidly diminishing returns for shipping the execution of an UPDATE/DELETE statement to a remote region if we are in an explicit transaction and not about to immediately commit that transaction. If we can't immediately commit then we are still holding locks across at least one gateway<->leaseholder RTT. So while the more general approach would provide linear speedups in a few cases, the specific case of implicit transactions could provide an exponential speedup because it eliminates all gateway<->leaseholder latency from the contention footprint of contending transactions.

knz commented 3 years ago

Interesting.

The convo needs to rewind a bit because we need a new word to describe what is going on here. These new things are not really distributed queries, neither are they a new way to execute queries within the "standard" txn processing machinery.

This work is defining a new transaction type entirely (besides implicit/explicit). Maybe "delegated transactions"?

I also think this needs a new GH issue because it's a separate problem to solve than that of distributing mutations, which we might still need/want to do later for HTAP support.

In any case, for the implementation there are a few architectural things to consider beyond the "primary" work of routing the queries (planning/execution):

the session variables
the session/txn/query ID (see note below)
the logging tags (see other note below)
the vtable contents (see other note below)

Notes :

session/txn/query ID. For CANCEL SESSION/QUERY we have an optimization today that the ID contains a node/instance ID and this helps with routing the cancel RPC. If the queries are not running on the node announced on the ID, there's some additional logic needed here.
logging tags: for various purposes (including security) we want the logging events emitted during the query's execution to report the specifics about the connection established at the gateway. This needs to flow the logging tags to the execution node. We also need additional log tags to report that the query has been delegated.
a query like "INSERT INTO tb SELECT FROM crdb_internal", when delegated, could end up writing different results to the table than when running on the gateway. There may be a couple of SQL built in function (and not just crdb internal) that are affected by this too. This is likely a problem, or in any case this should be thoroughly analyzed.

(it would be a UX headache if users had to think themselves about whether queries are delegated or not for correctness. Then we'd need to be hyper specific in docs, provide controls, etc. Would be nice to avoid these headaches).

For example care could be taken to either embark all the sensitive data needed as input to the mutation, or disallow delegation in these case entirely.
session traces will need to be repatriated.
we need to cross ref this discussion with the app stats RFC that archer is currently working on. Can we write the stats of a delegated txn on the remote node directly? Or do we need to flow the stats back?

knz commented 3 years ago

Nb pumping the contents of all the session vars everytime could be a prohibitive overhead too. Maybe what we need is some way to determine by looking just at the AST precisely which subset of the session vars will influence planning/execution for that query, and only ship that over.

rytaft commented 3 years ago

Shipping all of these session variables, etc, seems like it would be a lot of engineering effort with the potential to miss something important. I'm curious whether we can reuse some/most of the existing DistSQL logic but just add the ability to start the transaction on a remote node. @yuzefovich curious to hear your thoughts.

knz commented 3 years ago

the txn starting code is coordinated by the sql executor AFAIK. This is also where the retry logic is coordinated. If we rely on distsql we'd need to copy the sql executor "under" distsql (since it'd still be needed "outside" on the gateway for multi-txn stmts).

yuzefovich commented 3 years ago

[andy woods]: this is a large project that we need to fit into the roadmap, and although it seems important, we can't seem to find space for this issue in 21.2 time frame, so we're moving it to the backlog for now.

ajwerner commented 2 years ago

One thing that has changed since this issue was last updated is that we have tooling for serializing and deserializing session data reliably.

fabiog1901 commented 1 year ago

A related use-case can be described as follows:

Suppose we have a 3 region cluster. The gateway node receives a multi-statements/CTE query that involves write operations. It creates the plan, looks up the leaseholders involved and finds out that they are ALL in the same region, but NOT the gateway's node region. So the gateway picks a "Txn Executor node" in that region to start the transaction, which helps with the otherwise unavoidable hops back and forth between LHs nodes and gateway node. Once the txn commits, the TxnExecutor returns to the gateway node which returns to the client.

The goal is to avoid multiple roundtrip between gateway node and LHs nodes, which can be very expensive in latency terms if the LHs is outside region. For example for a CTE that wraps 1 UPDATE and 1 INSERT we see 3 roundtrips.

Slack

cockroachdb / cockroach

sql: distribute UPDATE/DELETE statements in DistSQL flows #49606

Blockers