sql, kv: support non-default key locking with leaf txns

Nican commented 1 year ago

With #94399 we no longer choose to use the leaf txn (which would allow us to use the streamer API and parallelize local scans in some cases) if we find non-default key locking strength. This stems from the fact that we don't propagate lock spans for leaf txns, and even if we did, we'd run into another complication (quoting Nathan):

The other complication that we would have run into with this approach is that only the root txn coordinator runs a heartbeat loop to keep its locks alive. If we only acquired locks in leaf txns, we would never have started the txn heartbeat and the txn could have been aborted after 5s. We could have launched the heartbeat when merging the LeafTxnFinalState, but that still only works if we call UpdateRootWithLeafFinalState within 5s of the initial lock being acquired.

I think this again hints at our implementation of explicit row-level locking and pushing that locking directly into scans (in all cases) being a little funky (see #75457 and #57031). If we had a separate RowLock execution node that lived above a scan/filter/limit/etc. then we could run that using the RootTxn and distribute the rest of the query as much as we'd like. We're going to have to do part of that work for read-committed.

We should figure out how to lift this restriction.

Original issue description

*Edit from @jordanlewis: See https://github.com/cockroachdb/cockroach/issues/94290#issuecomment-1366343646 for a trivial repro* *To work around this problem in 22.2.0 and 22.2.1, run the following* ``` SET CLUSTER SETTING sql.distsql.use_streamer.enabled = false; ``` ---- **Describe the problem** It looks like `v22.2.1` has some regression with `FOR UPDATE`. It looks like that a lock is being held on the row for 5 seconds after the transaction finishes. Several of my e2e tests are now failing due to timeout. I drilled into one the tests. Doing explain analyze on the select query after the transaction finishes shows `cumulative time spent in KV: 4.9s` and `cumulative time spent due to contention: 4.9s`. Find attached also a debug zip from explain ANALYZE(debug) from the query. Removing `for update` fixes the issue, OR reverting to `v22.1.8` fixes the issue. **To Reproduce** The test on my codebase fails pretty consistently. From what I can tell, the sequence of events: 1. Transaction starts. 2. Transaction runs `select * from tbl where id = 1 for update` 3. The row in `tbl` is NOT updated inside of the transaction, but other data in unrelated tables is updated/inserted. 4. Transaction completes successfully. 5. (only run after transaction is complete) `select * from tbl` now takes 5 seconds to run. **Expected behavior** Running `SELECT * FROM tbl;` when there is no other workload on the database to take a few milliseconds to run. **Additional data / screenshots** [stmt-bundle-825387661464305665.zip](https://github.com/cockroachdb/cockroach/files/10300490/stmt-bundle-825387661464305665.zip) **Environment:** - CockroachDB version v22.2.1 - Server OS: Linux on docker - Client app: Sequelize **Additional context** E2E (including clicking on buttons) Tests are failing due to timeout.

Jira issue: CRDB-22800

blathers-crl[bot] commented 1 year ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/kv (found keywords: KV)
@cockroachdb/sql-sessions (found keywords: Sequelize)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

rafiss commented 1 year ago

Thanks for the report! Have you tried with v22.2.0?

Nican commented 1 year ago

@rafiss Thanks for the response. I should have tried that.

I tried with v22.2.0, v22.2.0-rc.1, v22.2.0-beta.1, v22.2.0-alpha.1, and they all broken, and displays the 5s contention behavior.

v22.1.12 works as expected.

Nican commented 1 year ago

Hello @rafiss,

I was able to create a minimum viable repro of the issue. Note that each command needs to be run on separate connections. This is running (v22.2.1)

Create table:

-- drop table if exists dealers cascade;
CREATE TABLE dealers (
    "organizationId" UUID NOT NULL,
    "conventionId" INT8 NOT NULL,
    id INT8 NOT NULL,
    "userId" INT8 NULL,
    CONSTRAINT dealers_pkey PRIMARY KEY ("organizationId" ASC, id ASC),
    UNIQUE INDEX "dealers_userId_conventionId_key" ("userId" ASC, "conventionId" ASC),
    UNIQUE INDEX "dealers_conventionId_key" ("conventionId" ASC, id ASC)
);

Connection 1 (10ms):

insert into dealers values ('00000000-0000-0000-0000-000000000000', 5427958, 1, 1);

Connection 2 (Takes 5100ms to run):

START TRANSACTION;
    select * from dealers where "conventionId" = 5427958 and id = 1 for update;
COMMIT;
select * from dealers;

Oddly, removing the dealers_userId_conventionId_key index from the table also fixes the issue. Also removing for update also fixes the issue.

Please let me know if you are able to repro the issue.

jordanlewis commented 1 year ago

I can't repro this on 22.2.1, what is the exact order/interleaving of the commands you're running on each connection?

jordanlewis commented 1 year ago

Ok, I could reproduce it. The 2 connection thing is a red herring, you can repro on 1 connection like this:

root@127.0.0.1:26257/defaultdb> create table a (a int, b int, c int, primary key(a), unique index(b));
root@127.0.0.1:26257/defaultdb> insert into a values(1,2,3);
root@127.0.0.1:26257/defaultdb> select * from a where b = 2 for update;
  a | b | c
----+---+----
  1 | 2 | 3
(1 row)

Time: 3ms total (execution 3ms / network 0ms)

root@127.0.0.1:26257/defaultdb> select * from a where b = 2 for update;
  a | b | c
----+---+----
  1 | 2 | 3
(1 row)

Time: 4.542s total (execution 4.541s / network 0.001s)

Once you have the select for update in the shell, just hit up/enter quickly and you'll see the slow query.

I think this has to do with the FOR UPDATE locking strength on the induced index join somehow, as without ensuring the plan has an index join the issue doesn't repro.

EXPLAIN ANALYZE output on the slow iteration looks like this:

root@127.0.0.1:26257/defaultdb> explain analyze select * from a where b = 2 for update;
                                        info
------------------------------------------------------------------------------------
  planning time: 614µs
  execution time: 4.5s
  distribution: local
  vectorized: true
  rows read from KV: 2 (23 B, 2 gRPC calls)
  cumulative time spent in KV: 4.5s
  cumulative time spent due to contention: 4.5s
  maximum memory usage: 60 KiB
  network usage: 0 B (0 messages)
  regions: us-east1

  • index join
  │ nodes: n1
  │ regions: us-east1
  │ actual row count: 1
  │ KV time: 1ms
  │ KV contention time: 0µs
  │ KV rows read: 1
  │ KV bytes read: 13 B
  │ KV gRPC calls: 1
  │ estimated max memory allocated: 30 KiB
  │ estimated max sql temp disk usage: 0 B
  │ estimated row count: 1
  │ table: a@a_pkey
  │ locking strength: for update
  │
  └── • scan
        nodes: n1
        regions: us-east1
        actual row count: 1
        KV time: 4.5s
        KV contention time: 4.5s
        KV rows read: 1
        KV bytes read: 10 B
        KV gRPC calls: 1
        estimated max memory allocated: 20 KiB
        estimated row count: 1 (100% of the table; stats collected 47 seconds ago)
        table: a@a_b_key
        spans: [/2 - /2]
        locking strength: for update
(40 rows)

Note the 4.5 second "contention time".

jordanlewis commented 1 year ago

Bisected to b72c10955f5fab3f4d069568169d09a1934cbe15 (#77878), which unfortunately doesn't tell us much. Is the streamer using APIs incorrectly, or is there bug a bug in the API that the streamer is using?

cc @yuzefovich

Bisect log

``` git bisect start # status: waiting for both good and bad commits # good: [3b76f78d724dfc1e7bc8d697f5a7de960d8d1e98] Merge pull request #93143 from adityamaru/show-backup-speedup git bisect good 3b76f78d724dfc1e7bc8d697f5a7de960d8d1e98 # status: waiting for bad commit, 1 good commit known # bad: [77667a1b0101cd323090011f50cf910aaa933654] Merge pull request #91926 from cockroachdb/blathers/backport-release-22.2.0-91911 git bisect bad 77667a1b0101cd323090011f50cf910aaa933654 # good: [a1c1879e01ceee79a81693c67a1dba184b5fc1b1] Merge #77561 #77775 git bisect good a1c1879e01ceee79a81693c67a1dba184b5fc1b1 # bad: [8edc7ba73079eee467a8b9348260e01c703ac96c] Merge #83941 git bisect bad 8edc7ba73079eee467a8b9348260e01c703ac96c # bad: [ce7b1972441f8eaf5aa7772c56250b2d83806cab] Merge #80706 git bisect bad ce7b1972441f8eaf5aa7772c56250b2d83806cab # bad: [94fe8ddfaec046ac3a769506d65e98ce8d674048] Merge #79038 git bisect bad 94fe8ddfaec046ac3a769506d65e98ce8d674048 # bad: [7d941bbf95172d423c3fbfb23d554471e113c639] Merge #77785 #78266 #78268 #78470 git bisect bad 7d941bbf95172d423c3fbfb23d554471e113c639 # bad: [5bd50f5659a3ca88e8ad378be40c2f5a10496ec8] Merge #78276 git bisect bad 5bd50f5659a3ca88e8ad378be40c2f5a10496ec8 # bad: [0b6d5024573cd39b7dcf8c93cc9dce5eaa38312d] Merge #77993 git bisect bad 0b6d5024573cd39b7dcf8c93cc9dce5eaa38312d # good: [8179b6f996149ef9e04eb62f9c95ecc992b9f30e] Merge #77853 git bisect good 8179b6f996149ef9e04eb62f9c95ecc992b9f30e # bad: [d063fcf94d83ed330b548c4fd3e0c4db6a20ce10] opt: add missing documentation for opt tester options git bisect bad d063fcf94d83ed330b548c4fd3e0c4db6a20ce10 # good: [6c1e2c29ddcf4ac190eb91f82ca530392ef5078e] roachprod: update thrift artifact url for use with charybdefs git bisect good 6c1e2c29ddcf4ac190eb91f82ca530392ef5078e # good: [f43648aeea968840c3ea9932eb8e3e13f45140c5] sql: use IndexFetchSpec for inverted joiner git bisect good f43648aeea968840c3ea9932eb8e3e13f45140c5 # bad: [1a35e55e854ddb90632ac581b10a593e6e4f07d9] Merge #77875 #77878 #77968 #77995 #78013 #78023 git bisect bad 1a35e55e854ddb90632ac581b10a593e6e4f07d9 # good: [242118f0ad421dc949108165c61967e4ff697d82] roachtest: gracefully fail in sst-corruption test git bisect good 242118f0ad421dc949108165c61967e4ff697d82 # bad: [b72c10955f5fab3f4d069568169d09a1934cbe15] kvstreamer: re-enable streamer by default git bisect bad b72c10955f5fab3f4d069568169d09a1934cbe15 # first bad commit: [b72c10955f5fab3f4d069568169d09a1934cbe15] kvstreamer: re-enable streamer by default commit b72c10955f5fab3f4d069568169d09a1934cbe15 Author: Yahor Yuzefovich Date: Tue Mar 15 19:53:57 2022 -0700 kvstreamer: re-enable streamer by default Release note: None pkg/sql/kvstreamer/large_keys_test.go | 3 --- pkg/sql/kvstreamer/streamer_test.go | 12 ++---------- pkg/sql/mem_limit_test.go | 7 ------- pkg/sql/row/kv_batch_streamer.go | 2 +- 4 files changed, 3 insertions(+), 21 deletions(-) ```

jordanlewis commented 1 year ago

Sure enough, disabling the streamer cluster setting removes the problem:

demo@127.0.0.1:26257/defaultdb> SET CLUSTER SETTING sql.distsql.use_streamer.enabled = false;
demo@127.0.0.1:26257/defaultdb> select * from a where b = 2 for update;

  a | b | c
----+---+----
  1 | 2 | 3
(1 row)

Time: 3ms total (execution 3ms / network 0ms)

demo@127.0.0.1:26257/defaultdb> select * from a where b = 2 for update;

  a | b | c
----+---+----
  1 | 2 | 3
(1 row)

Time: 4ms total (execution 4ms / network 0ms)

EXPLAIN ANALYZE:

demo@127.0.0.1:26257/defaultdb> explain analyze select * from a where b = 2 for update;

                                       info
----------------------------------------------------------------------------------
  planning time: 251µs
  execution time: 1ms
  distribution: local
  vectorized: true
  rows read from KV: 2 (27 B, 2 gRPC calls)
  cumulative time spent in KV: 1ms
  maximum memory usage: 50 KiB
  network usage: 0 B (0 messages)
  estimated RUs consumed: 0

  • index join
  │ nodes: n1
  │ actual row count: 1
  │ KV time: 391µs
  │ KV contention time: 0µs
  │ KV rows read: 1
  │ KV bytes read: 15 B
  │ KV gRPC calls: 1
  │ estimated max memory allocated: 20 KiB
  │ estimated max sql temp disk usage: 0 B
  │ estimated row count: 1
  │ table: a@a_pkey
  │ locking strength: for update
  │
  └── • scan
        nodes: n1
        actual row count: 1
        KV time: 738µs
        KV contention time: 0µs
        KV rows read: 1
        KV bytes read: 12 B
        KV gRPC calls: 1
        estimated max memory allocated: 20 KiB
        estimated row count: 1 (100% of the table; stats collected 1 minute ago)
        table: a@a_b_key
        spans: [/2 - /2]
        locking strength: for update
(37 rows)

Time: 3ms total (execution 3ms / network 0ms)

yuzefovich commented 1 year ago

I think the difference here is that the streamer is using LeafTxn (since there might be concurrency going on) whereas the non-streamer code path uses RootTxn. (I confirmed this by hacking the code for the streamer to also use the root, and the problem disappeared.)

The behavior is such that it seems that the locks with the leaf txn are only released after about 4.5s have passed after the txn that acquired them was committed. Maybe this was already mentioned elsewhere, but the problematic stmt is not the second statement that blocks (which can actually use either the streamer or the non-streamer code path), but the first stmt that used the streamer.

I think we need KV expertise here. Perhaps @nvanbenschoten can share some wisdom.

yuzefovich commented 1 year ago

@jordanlewis what's your take on the severity of this issue? Should we block 22.2.2 release? Should we consider disabling the streamer in 22.2.2? I'm tentatively adding the corresponding release blocker labels.

knz commented 1 year ago

~I thought the API contract was that a LeafTxn cannot perform writes, including placing locks? That's what i remember from the txn / txncoordsender tech note.~

Nvm, it can place lock spans. But they need to be "imported" into the RootTxn at the end.

DrewKimball commented 1 year ago

But they need to be "imported" into the RootTxn at the end.

That should happen during metadata draining here where GetLeafTxnFinalState is called by the index join and then here when UpdateRootWithLeafFinalState is called in the DistSQLReceiver, right? But the result of GetLeafTxnFinalState has no lock spans when I try the example.

yuzefovich commented 1 year ago

GetLeafTxnFinalState is only about the refresh spans (at least right now). Perhaps there is a gap when FOR UPDATE was implemented - namely, it was assumed that we'd always be using the root txn (since FOR UPDATE implies that the txn will be performing writes which would prohibit the usage of the leaf txn), but this assumption is not verified, and the usage of the streamer breaks that assumption.

yuzefovich commented 1 year ago

Probably the best quick fix is to examine the plan to see whether non-default key-locking is used and to not use the streamer if so. I'll work on that today.

knz commented 1 year ago

Is it not simpler to extend GetLeafTxnFinalState to also include the lock spans if there's any?

yuzefovich commented 1 year ago

Thanks. Indeed, that seems simple enough.

knz commented 1 year ago

maybe double check that UpdateRootWithLeafFinalState merges the lock spans too

yuzefovich commented 1 year ago

@Nican thanks for filing the issue, and @jordanlewis thanks for determining the concise reproduction. We have merged the fixes to master and 22.2 branches, and it should be included in the 22.2.2 release.

This issue now becomes about figuring out how to lift the restriction introduced in #94399. In particular, we either need to propagate the lock spans across leaf txns or re-implement how we do row-level locking. I'll update the issue description accordingly.

cockroachdb / cockroach

sql, kv: support non-default key locking with leaf txns #94290