roachtest: filesystem_simulator child does not exist in files failure

SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/filesystem_simulator/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog

The test failed on master:
    cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1189954-scaledata-filesystem-simulator-nodes-3:4 -- ./filesystem_simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.142.0.58:26257,10.142.0.66:26257,10.142.0.46:26257'  returned:
        stderr:
        ated file 12_244 with uuid 21799f7c-53d1-4378-8864-598e9da28f91 and parent /default
        2019/03/21 13:05:34 Created file 12_244 with uuid 5162ffbf-3a2f-4739-8287-383b0caa0654 and parent /default
        2019/03/21 13:05:34 RobustDB.RandomDB chose DB at index 2
        2019/03/21 13:05:34 Created file 14_270 with uuid 3800ed30-06f0-47b5-8902-65ea2fa513de and parent /default
        2019/03/21 13:05:34 Created file 14_270 with uuid bde82c92-0763-4d9c-875f-739279036dce and parent /default
        2019/03/21 13:05:34 RobustDB.RandomDB chose DB at index 1
        2019/03/21 13:05:34 Created file 2_256 with uuid 82d966b6-b090-40db-b37e-d08a2c0aa7f8 and parent /default
        2019/03/21 13:05:34 Writing new stripe 0
        2019/03/21 13:05:34 &{c8f4cfdb-c31d-471e-ba17-5b4131dcc9dd 0 default}
        2019/03/21 13:05:34 Consistency Test sizes :- files - 1630, childRelations - 1630, stripes - 277
        2019/03/21 13:05:34 ChildRelation {/default 12_241 1d4c157f-11ca-453a-b04f-fbb41ba02f11 default}: 1d4c157f-11ca-453a-b04f-fbb41ba02f11 child does not exist in files
        Error:  exit status 255

        stdout:
        : exit status 1

This doesn't look good. We see a consistency check transaction find a missing file that we previous log about inserting.

2019/03/21 13:05:32 Created file 12_241 with uuid 1d4c157f-11ca-453a-b04f-fbb41ba02f11 and parent /default
...
2019/03/21 13:05:34 ChildRelation {/default 12_241 1d4c157f-11ca-453a-b04f-fbb41ba02f11 default}: 1d4c157f-11ca-453a-b04f-fbb41ba02f11 child does not exist in files

We insert the files row here and then insert the child_relation row referencing (without an fk) the file here.

One possibility is that we're seeing a form of causal reverse, where the child_relation row is inserted at a lower timestamp than the corresponding files row. Because there is no foreign key between the two and they are inserted in different transactions, it doesn't seem impossible. The consistency check could be running concurrently with the two insertions, which is the only other requirement for causal reverse.

I'm filing this as a release blocker, although the resolution may just be convincing ourselves that this behavior is in line with our consistency model and fixing the test.

and they are inserted in different transactions

Never mind, this isn't true. The rows are inserted in the same transaction.

One interesting point is that the transaction inserted the file and then updated it here. Both operations succeeded and both claim to have updated a single row.

I wonder if this is another follower read gone bad. How should follower reads interact with uncertainty intervals?

CHAOS.log has only one line:

CHAOS: 13:05:37 chaos.go:65: chaos stopping: context canceled

This cames after the consistency check failure at 13:05:34. So we can ignore chaos as a possible cause of this.

I wonder if this is another follower read gone bad. How should follower reads interact with uncertainty intervals?

I think follower reads can only be safe if the entire uncertainty window is less than the closed timestamp. I'm not sure whether we can use ObservedTimestamps for this at all or if we have to use the worst-case MaxTimestamp for the transaction.

I ran into a variant of this failure last night. The consistency check hit the following error:

2019/03/27 04:43:54 Created file 6_285 with uuid 630c9920-766d-4611-9ef0-f7b6d14c6270 and parent /default
2019/03/27 04:43:54 RobustDB.RandomDB chose DB at index 2
2019/03/27 04:43:54 Created file 14_298 with uuid ca8fa2f0-e1a3-47d5-ba68-bd55b10618fd and parent /default
2019/03/27 04:43:54 Created file 14_298 with uuid a6ffd560-b143-4171-bc73-e197189eea4e and parent /default
2019/03/27 04:43:54 RobustDB.RandomDB chose DB at index 2
2019/03/27 04:43:54 Created file 15_285 with uuid be7e31bc-2205-4702-9ba5-026897b22751 and parent /default
2019/03/27 04:43:54 Created file 15_285 with uuid 7015f1b6-30f7-485a-94b7-229a512ad343 and parent /default
2019/03/27 04:43:54 RobustDB.RandomDB chose DB at index 1
2019/03/27 04:43:54 Created file 10_322 with uuid 49258fdd-d185-4f8e-a316-cca37284b217 and parent /default
2019/03/27 04:43:54 Consistency Test sizes :- files - 1930, childRelations - 1928, stripes - 295
2019/03/27 04:43:54 2b363034-cdcd-432a-bba7-7b2cefb236c9 is parentless

The logs revealed that file 2b363034 was deleted by removeFile a few seconds earlier. That transaction runs DELETE statements on the stripes table, the child_relations, and the files table. The consistency check found that file 2b363034 was missing from the child_relations table but not the files table.

I started by digging into the files and child_relations table. In a consistent state, the files table should have one more row than the child_relations table. We see above that this was not the case during the consistency check (1930 vs. 1928). By the time I was able to dig in, the files table looked to have 1934 rows and the child_relations table looked to have 1933 rows. It appeared that the consistency violation had corrected itself. However, I could still see that file 2b363034 was only in the files table. This didn't make much sense because it meant that there had to be a row only in the child_relations table to balance out the counts. My attempts at finding this row were unsuccesful and I began doubting that I understood how FULL OUTER JOINs worked:

select c.* from child_relations c FULL OUTER JOIN files f on f.uuid = c.child_uuid;

...
  /default    | 9_98   | cbe8686e-26c8-4a1e-8f1f-c63d18c797b2 | default
  NULL        | NULL   | NULL                                 | NULL
(1934 rows)

select f.*, c.* from child_relations c FULL OUTER JOIN files f on f.uuid = c.child_uuid;

...
  /default    | 9_98   | cbe8686e-26c8-4a1e-8f1f-c63d18c797b2 | default | cbe8686e-26c8-4a1e-8f1f-c63d18c797b2 |    2 | 4096 |         0 | default
  NULL        | NULL   | NULL                                 | NULL    | /default                             |    2 | 4096 |       256 | default
  NULL        | NULL   | NULL                                 | NULL    | 2b363034-cdcd-432a-bba7-7b2cefb236c9 |    1 |    0 |       108 | default
(1935 rows)

select count(*) from [select f.*, c.* from child_relations c FULL OUTER JOIN files f on f.uuid = c.child_uuid];

  count
+-------+
   1934

Notice the 1935 rows when we include f.* in the projection. It then drops to 1934 rows when we select the count from this join or we drop the predicate. I wondered if there was a bug in the optimizer, which led me to the following finding:

root@localhost:26257/fs> SET OPTIMIZER=ON;
SET

root@localhost:26257/fs> explain select c.child_uuid from child_relations c FULL OUTER JOIN files f on f.uuid = c.child_uuid;
       tree      |       field       |          description
+----------------+-------------------+--------------------------------+
  render         |                   |
   └── hash-join |                   |
        │        | type              | full outer
        │        | equality          | (uuid) = (child_uuid)
        │        | left cols are key |
        ├── scan |                   |
        │        | table             | files@idx_sub_shard
        │        | spans             | ALL
        └── scan |                   |
                 | table             | child_relations@idx_child_uuid
                 | spans             | ALL
(11 rows)

root@localhost:26257/fs> select count(*) from [select c.child_uuid from child_relations c FULL OUTER JOIN files f on f.uuid = c.child_uuid];
  count
+-------+
   1934
(1 row)

root@localhost:26257/fs> SET OPTIMIZER=OFF;
SET

root@localhost:26257/fs> explain select c.child_uuid from child_relations c FULL OUTER JOIN files f on f.uuid = c.child_uuid;
       tree      |  field   |       description
+----------------+----------+-------------------------+
  render         |          |
   └── hash-join |          |
        │        | type     | full outer
        │        | equality | (child_uuid) = (uuid)
        ├── scan |          |
        │        | table    | child_relations@primary
        │        | spans    | ALL
        └── scan |          |
                 | table    | files@primary
                 | spans    | ALL
(10 rows)

root@localhost:26257/fs> select count(*) from [select c.child_uuid from child_relations c FULL OUTER JOIN files f on f.uuid = c.child_uuid];
  count
+-------+
   1935
(1 row)

Depending on the index used in the files table, we got different results. In isolation, I could see the following inconsistency:

root@localhost:26257/fs> select count(*) from files@primary;
  count
+-------+
   1935
(1 row)

root@localhost:26257/fs> select count(*) from files@idx_shard;
  count
+-------+
   1934
(1 row)

root@localhost:26257/fs> select count(*) from files@idx_sub_shard;
  count
+-------+
   1934
(1 row)

root@localhost:26257/fs> select f1.*, f2.* from files@primary f1 left join files@idx_sub_shard f2 on f1.uuid = f2.uuid where f2.uuid is null;
                  uuid                 | type | size | sub_shard |  shard  | uuid | type | size | sub_shard | shard
+--------------------------------------+------+------+-----------+---------+------+------+------+-----------+-------+
  2b363034-cdcd-432a-bba7-7b2cefb236c9 |    1 |    0 |       108 | default | NULL | NULL | NULL |      NULL | NULL

So file 2b363034 was deleted from files@idx_shard and files@idx_sub_shard, but not files@primary. That's really bad. We're seeing a transaction that was clearly non-atomic.

I'm running again with logging around transaction pushes and intent resolution. I also turned off Raft log truncation, which should help piece together what went wrong after the fact. I just hit the same failure again after about 128 runs with this extra debug information.

In the new failure, I see that both files@primary and files@idx_sub_shard are missing the deletion tombstone for the row:

root@localhost:26257/fs> select count(*) from files@primary;
  count
+-------+
   8505
(1 row)

root@localhost:26257/fs> select count(*) from files@idx_shard;
  count
+-------+
   8504
(1 row)

root@localhost:26257/fs> select count(*) from files@idx_sub_shard;
  count
+-------+
   8505
(1 row)

The logging I added around intent resolution explains why:

I190327 16:30:45.318701 171920 storage/engine/mvcc.go:2317  [n2,s2,r21/2:/Table/5{3-4}] resolving intent {Span:/Table/53/2/50/"a055ee54-cf2f-401f-935e-63f431050f3b"/0 Txn:{ID:f774ea66-5428-4e87-ae8c-1501c071d149 Key:[190 138 18 97 48 53 53 101 101 53 52 45 99 102 50 102 45 52 48 49 102 45 57 51 53 101 45 54 51 102 52 51 49 48 53 48 102 51 98 0 1 18 47 100 101 102 97 117 108 116 0 1 18 48 95 49 49 56 55 0 1 136] Epoch:0 Timestamp:1553704245.286560684,0 Priority:255407 Sequence:4 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} Status:ABORTED XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} txn:<id:f774ea66-5428-4e87-ae8c-1501c071d149 key:"\276\212\022a055ee54-cf2f-401f-935e-63f431050f3b\000\001\022/default\000\001\0220_1187\000\001\210" timestamp:<wall_time:1553704245286560684 > priority:255407 sequence:4 > timestamp:<wall_time:1553704245286560684 logical:0 > deleted:true key_bytes:12 val_bytes:0  (false,false)
I190327 16:30:45.327285 171933 storage/engine/mvcc.go:2317  [n2,s2,r21/2:/Table/5{3-4}] resolving intent {Span:/Table/53/1/"a055ee54-cf2f-401f-935e-63f431050f3b"/0 Txn:{ID:f774ea66-5428-4e87-ae8c-1501c071d149 Key:[190 138 18 97 48 53 53 101 101 53 52 45 99 102 50 102 45 52 48 49 102 45 57 51 53 101 45 54 51 102 52 51 49 48 53 48 102 51 98 0 1 18 47 100 101 102 97 117 108 116 0 1 18 48 95 49 49 56 55 0 1 136] Epoch:0 Timestamp:1553704245.286560684,0 Priority:255407 Sequence:6 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} Status:ABORTED XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} txn:<id:f774ea66-5428-4e87-ae8c-1501c071d149 key:"\276\212\022a055ee54-cf2f-401f-935e-63f431050f3b\000\001\022/default\000\001\0220_1187\000\001\210" timestamp:<wall_time:1553704245286560684 > priority:255407 sequence:6 > timestamp:<wall_time:1553704245286560684 logical:0 > deleted:true key_bytes:12 val_bytes:0  (false,false)
I190327 16:30:45.473130 224749 storage/engine/mvcc.go:2317  [n1,s1,r22/1:/Table/5{4-5}] resolving intent {Span:/Table/54/1/"/default"/"0_1187"/0 Txn:{ID:f774ea66-5428-4e87-ae8c-1501c071d149 Key:[190 138 18 97 48 53 53 101 101 53 52 45 99 102 50 102 45 52 48 49 102 45 57 51 53 101 45 54 51 102 52 51 49 48 53 48 102 51 98 0 1 18 47 100 101 102 97 117 108 116 0 1 18 48 95 49 49 56 55 0 1 136] Epoch:0 Timestamp:1553704245.286560684,0 Priority:255407 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} Status:COMMITTED XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} txn:<id:f774ea66-5428-4e87-ae8c-1501c071d149 key:"\276\212\022a055ee54-cf2f-401f-935e-63f431050f3b\000\001\022/default\000\001\0220_1187\000\001\210" timestamp:<wall_time:1553704245286560684 > priority:255407 sequence:3 > timestamp:<wall_time:1553704245286560684 logical:0 > deleted:true key_bytes:12 val_bytes:0  (true,false)
I190327 16:30:45.473221 224749 storage/engine/mvcc.go:2317  [n1,s1,r22/1:/Table/5{4-5}] resolving intent {Span:/Table/54/2/"a055ee54-cf2f-401f-935e-63f431050f3b"/"/default"/"0_1187"/0 Txn:{ID:f774ea66-5428-4e87-ae8c-1501c071d149 Key:[190 138 18 97 48 53 53 101 101 53 52 45 99 102 50 102 45 52 48 49 102 45 57 51 53 101 45 54 51 102 52 51 49 48 53 48 102 51 98 0 1 18 47 100 101 102 97 117 108 116 0 1 18 48 95 49 49 56 55 0 1 136] Epoch:0 Timestamp:1553704245.286560684,0 Priority:255407 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} Status:COMMITTED XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} txn:<id:f774ea66-5428-4e87-ae8c-1501c071d149 key:"\276\212\022a055ee54-cf2f-401f-935e-63f431050f3b\000\001\022/default\000\001\0220_1187\000\001\210" timestamp:<wall_time:1553704245286560684 > priority:255407 sequence:1 > timestamp:<wall_time:1553704245286560684 logical:0 > deleted:true key_bytes:12 val_bytes:0  (true,false)
I190327 16:30:45.473314 224749 storage/engine/mvcc.go:2317  [n1,s1,r22/1:/Table/5{4-5}] resolving intent {Span:/Table/54/3/"default"/"/default"/"0_1187"/0 Txn:{ID:f774ea66-5428-4e87-ae8c-1501c071d149 Key:[190 138 18 97 48 53 53 101 101 53 52 45 99 102 50 102 45 52 48 49 102 45 57 51 53 101 45 54 51 102 52 51 49 48 53 48 102 51 98 0 1 18 47 100 101 102 97 117 108 116 0 1 18 48 95 49 49 56 55 0 1 136] Epoch:0 Timestamp:1553704245.286560684,0 Priority:255407 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} Status:COMMITTED XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} txn:<id:f774ea66-5428-4e87-ae8c-1501c071d149 key:"\276\212\022a055ee54-cf2f-401f-935e-63f431050f3b\000\001\022/default\000\001\0220_1187\000\001\210" timestamp:<wall_time:1553704245286560684 > priority:255407 sequence:2 > timestamp:<wall_time:1553704245286560684 logical:0 > deleted:true key_bytes:12 val_bytes:0  (true,false)
W190327 16:30:45.480944 172096 storage/engine/mvcc.go:2257  [n2,s2,r21/2:/Table/5{3-4}] unable to find value for /Table/53/1/"a055ee54-cf2f-401f-935e-63f431050f3b"/0 @ 1553704245.286560684,0: 1553704215.209050672,0 (txn={ID:f774ea66-5428-4e87-ae8c-1501c071d149 Key:[190 138 18 97 48 53 53 101 101 53 52 45 99 102 50 102 45 52 48 49 102 45 57 51 53 101 45 54 51 102 52 51 49 48 53 48 102 51 98 0 1 18 47 100 101 102 97 117 108 116 0 1 18 48 95 49 49 56 55 0 1 136] Epoch:0 Timestamp:1553704245.286560684,0 Priority:255407 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
W190327 16:30:45.481012 172096 storage/engine/mvcc.go:2257  [n2,s2,r21/2:/Table/5{3-4}] unable to find value for /Table/53/2/50/"a055ee54-cf2f-401f-935e-63f431050f3b"/0 @ 1553704245.286560684,0: 1553704215.209050672,0 (txn={ID:f774ea66-5428-4e87-ae8c-1501c071d149 Key:[190 138 18 97 48 53 53 101 101 53 52 45 99 102 50 102 45 52 48 49 102 45 57 51 53 101 45 54 51 102 52 51 49 48 53 48 102 51 98 0 1 18 47 100 101 102 97 117 108 116 0 1 18 48 95 49 49 56 55 0 1 136] Epoch:0 Timestamp:1553704245.286560684,0 Priority:255407 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0})
I190327 16:30:45.481115 172096 storage/engine/mvcc.go:2317  [n2,s2,r21/2:/Table/5{3-4}] resolving intent {Span:/Table/53/3/"default"/"a055ee54-cf2f-401f-935e-63f431050f3b"/0 Txn:{ID:f774ea66-5428-4e87-ae8c-1501c071d149 Key:[190 138 18 97 48 53 53 101 101 53 52 45 99 102 50 102 45 52 48 49 102 45 57 51 53 101 45 54 51 102 52 51 49 48 53 48 102 51 98 0 1 18 47 100 101 102 97 117 108 116 0 1 18 48 95 49 49 56 55 0 1 136] Epoch:0 Timestamp:1553704245.286560684,0 Priority:255407 Sequence:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} Status:COMMITTED XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} txn:<id:f774ea66-5428-4e87-ae8c-1501c071d149 key:"\276\212\022a055ee54-cf2f-401f-935e-63f431050f3b\000\001\022/default\000\001\0220_1187\000\001\210" timestamp:<wall_time:1553704245286560684 > priority:255407 sequence:5 > timestamp:<wall_time:1553704245286560684 logical:0 > deleted:true key_bytes:12 val_bytes:0  (true,false)

The first two log lines reveal that the two missing keys had intents that were resolved with an ABORTED status. The transaction then went on to commit (with a txn record on r22). When it went to resolve the two missing keys as COMMITTED, it observed that they were missing and logged the unable to find value warning.

I added in logging to transaction pushes, which didn't fire for this transaction. However, I missed the case where a transaction record should never be writable. But if that fired, then how did the transaction eventually commit? I'll add logging back around here and try to repro.

Another wild theory is that a QueryTxn is going to a follower and being processed as a follower read. That might cause some issues with our detection of aborted transactions here. At a minimum, it would undermine deadlock detection. @ajwerner do you happen to know off the top of your head if this is possible? We may need to constrain valid follower read requests to just those where roachpb.IsTransactional(req) == true. I'll add logging there as well.

I got another reproduction with this. This time I had enough logging to get a good idea of what's going on. The logging reveals that transaction 27a9a11f attempted to delete a number of keys but failed to delete one of them. We see that it wrote the keys and then got hit by a series of PushTxn requests:

I190328 13:29:22.666332 117734 storage/batcheval/transaction.go:145  can create (synth) false (ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY) {27a9a11f-8765-437b-a6a2-9a7488c5a292 [190 138 18 50 55 57 55 97 53 49 53 45 101 50 55 102 45 52 98 98 49 45 98 52 97 48 45 55 52 57 101 102 51 56 98 48 99 50 100 0 1 18 47 100 101 102 97 117 108 116 0 1 18 57 95 52 50 57 0 1 136] 0 1553779762.501650162,0 1747296 6 {} 0}
I190328 13:29:22.666380 117734 storage/batcheval/cmd_push_txn.go:165  [n1,s1,r63/1:/Table/54/{1/"/def…-2/"807e…}] push uncommittable PUSH_TIMESTAMP id=27a9a11f key=/Table/54/2/"2797a515-e27f-4bb1-b4a0-749ef38b0c2d"/"/default"/"9_429"/0 rw=true pri=0.08136481 stat=ABORTED epo=0 ts=1553779762.501650162,0 orig=1553779762.501650162,0 max=0.000000000,0 wto=false seq=6
I190328 13:29:22.695978 117742 storage/batcheval/transaction.go:145  can create (synth) false (ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY) {27a9a11f-8765-437b-a6a2-9a7488c5a292 [190 138 18 50 55 57 55 97 53 49 53 45 101 50 55 102 45 52 98 98 49 45 98 52 97 48 45 55 52 57 101 102 51 56 98 48 99 50 100 0 1 18 47 100 101 102 97 117 108 116 0 1 18 57 95 52 50 57 0 1 136] 0 1553779762.501650162,0 1747296 6 {} 0}
I190328 13:29:22.696040 117742 storage/batcheval/cmd_push_txn.go:165  [n1,s1,r63/1:/Table/54/{1/"/def…-2/"807e…}] push uncommittable PUSH_TIMESTAMP id=27a9a11f key=/Table/54/2/"2797a515-e27f-4bb1-b4a0-749ef38b0c2d"/"/default"/"9_429"/0 rw=true pri=0.08136481 stat=ABORTED epo=0 ts=1553779762.501650162,0 orig=1553779762.501650162,0 max=0.000000000,0 wto=false seq=6
I190328 13:29:22.699161 117705 storage/batcheval/transaction.go:145  can create (synth) false (ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY) {27a9a11f-8765-437b-a6a2-9a7488c5a292 [190 138 18 50 55 57 55 97 53 49 53 45 101 50 55 102 45 52 98 98 49 45 98 52 97 48 45 55 52 57 101 102 51 56 98 48 99 50 100 0 1 18 47 100 101 102 97 117 108 116 0 1 18 57 95 52 50 57 0 1 136] 0 1553779762.501650162,0 1747296 6 {} 0}
I190328 13:29:22.699213 117705 storage/batcheval/cmd_push_txn.go:165  [n1,s1,r63/1:/Table/54/{1/"/def…-2/"807e…}] push uncommittable PUSH_TIMESTAMP id=27a9a11f key=/Table/54/2/"2797a515-e27f-4bb1-b4a0-749ef38b0c2d"/"/default"/"9_429"/0 rw=true pri=0.08136481 stat=ABORTED epo=0 ts=1553779762.501650162,0 orig=1553779762.501650162,0 max=0.000000000,0 wto=false seq=6
I190328 13:29:22.700728 117763 storage/batcheval/transaction.go:145  can create (synth) false (ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY) {27a9a11f-8765-437b-a6a2-9a7488c5a292 [190 138 18 50 55 57 55 97 53 49 53 45 101 50 55 102 45 52 98 98 49 45 98 52 97 48 45 55 52 57 101 102 51 56 98 48 99 50 100 0 1 18 47 100 101 102 97 117 108 116 0 1 18 57 95 52 50 57 0 1 136] 0 1553779762.501650162,0 1747296 6 {} 0}
I190328 13:29:22.700774 117763 storage/batcheval/cmd_push_txn.go:165  [n1,s1,r63/1:/Table/54/{1/"/def…-2/"807e…}] push uncommittable PUSH_TIMESTAMP id=27a9a11f key=/Table/54/2/"2797a515-e27f-4bb1-b4a0-749ef38b0c2d"/"/default"/"9_429"/0 rw=true pri=0.08136481 stat=ABORTED epo=0 ts=1553779762.501650162,0 orig=1553779762.501650162,0 max=0.000000000,0 wto=false seq=6

So the PushTxn requests are hitting n1,r63 and determining that the transaction is uncommittable after finding that the low-water mark of the timestamp cache is too high for it to ever create its transaction record. Note that PushTxnRequest has the isWrite flag, so we know that the n1,r63 was the leaseholder of the range when these requests evaluated.

We then see one of the pushers use this as the basis for removing the intent that goes missing:

I190328 13:29:22.731063 88300 storage/engine/mvcc.go:2317  [n2,s2,r21/3:/Table/53{-/1/"8c4c…}] resolving intent {Span:/Table/53/1/"2797a515-e27f-4bb1-b4a0-749ef38b0c2d"/0 Txn:{ID:27a9a11f-8765-437b-a6a2-9a7488c5a292 Key:[190 138 18 50 55 57 55 97 53 49 53 45 101 50 55 102 45 52 98 98 49 45 98 52 97 48 45 55 52 57 101 102 51 56 98 48 99 50 100 0 1 18 47 100 101 102 97 117 108 116 0 1 18 57 95 52 50 57 0 1 136] Epoch:0 Timestamp:1553779762.501650162,0 Priority:1747296 Sequence:6 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} Status:ABORTED XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} txn:<id:27a9a11f-8765-437b-a6a2-9a7488c5a292 key:"\276\212\0222797a515-e27f-4bb1-b4a0-749ef38b0c2d\000\001\022/default\000\001\0229_429\000\001\210" timestamp:<wall_time:1553779762501650162 > priority:1747296 sequence:6 > timestamp:<wall_time:1553779762501650162 logical:0 > deleted:true key_bytes:12 val_bytes:0  (false,false)

(notice Status:ABORTED)

Shortly afterwards, we start seeing QueryTxn requests directed at n2,r22, which determines that the transaction is still PENDING. This should not be possible!

I190328 13:29:22.779483 88303 storage/batcheval/transaction.go:145  can create (synth) true (nil) {27a9a11f-8765-437b-a6a2-9a7488c5a292 [190 138 18 50 55 57 55 97 53 49 53 45 101 50 55 102 45 52 98 98 49 45 98 52 97 48 45 55 52 57 101 102 51 56 98 48 99 50 100 0 1 18 47 100 101 102 97 117 108 116 0 1 18 57 95 52 50 57 0 1 136] 0 1553779762.501650162,0 1747296 6 {} 0}
I190328 13:29:22.779546 88303 storage/batcheval/cmd_query_txn.go:72  [n2,s2,r22/2:/Table/54{-/2/"807e…}] query txn id=27a9a11f key=/Table/54/2/"2797a515-e27f-4bb1-b4a0-749ef38b0c2d"/"/default"/"9_429"/0 rw=true pri=0.08136481 stat=PENDING epo=0 ts=1553779762.501650162,0 orig=1553779762.501650162,0 max=0.000000000,0 wto=false seq=6

So a transaction record that was determined uncreatable by the timestamp cache is no longer determined uncreatable? What is happening? If we look closely, the QueryTxn isn't evaluated on the same node as the PushTxn (n2 vs. n1). It's also not evaluated on the same Range (r22 vs. r63). So the range must have been merged. The logs confirm this:

I190328 13:29:22.697834 88196 storage/replica_command.go:424  [n2,merge,s2,r22/2:/Table/54{-/1/"/def…}] initiating a merge of r63:/Table/54/{1/"/default"/"8_176"-2/"807e6d6e-dfbe-4aa9-a53d-fa5cc4fb53f3"/"/default"/"1_364"/0} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=1] into this range (lhs+rhs has (size=313 KiB+205 KiB qps=4.81+0.00 --> 4.81qps) below threshold (size=518 KiB, qps=4.81))

I went back to logs from the previous repro and saw a similarly timed Range merge that removes the Range containing the transaction record.

So we're seeing what looks to be a timestamp cache regression due to a Range merge. Stranger things have happened. Time to go learn how timestamp cache updates are supposed to work during Range merges.

Another wild theory is that a QueryTxn is going to a follower and being processed as a follower read.

This theory didn't pan out, but it still seems like a problem. I opened an issue here: https://github.com/cockroachdb/cockroach/issues/36276.

So we're seeing what looks to be a timestamp cache regression due to a Range merge.

ooh, that's good. The leaseholder for the RHS of a merge doesn't have to be colocated with the leaseholder for the LHS.

The leaseholder for the RHS of a merge doesn't have to be colocated with the leaseholder for the LHS.

Do we pull a timestamp from the leaseholder of the RHS after it has been frozen and use that to bump the low-watermark of the RHS of the merged range?

and then we do this:

https://github.com/cockroachdb/cockroach/blob/fcfd9b8edd1a8bd4e5527abc98318b1670a99592/pkg/storage/store.go#L2305-L2321

One question I have is whether this actually happens when the merge is applied through a snapshot. I don't suspect that this happened here, though.

I just got another reproduction with extra logging. This again showed a transaction being declared as "uncommittable" with a timestamp of 1553817538.625903013 to a PushTxn request evaluated on r64/n3 like we saw before. We then see a merge choose a freeze timestamp of 1553817538.695557050 during the evaluation of its Subsume request of r64/n3. We see the subsumer r22/n1 apply the timestamp cache update with a timestamp of 1553817538.695557050. Finally, we see r22/n1 allow for the creation of the transaction record.

So how is this allowed? Well, we probably should have seen it earlier. We set the low-water mark of the timestamp cache for the global keyspace, but not the local keyspace. So there's nothing trying to prevent this from happening!

We need to set the low-water mark of the timestamp cache for all of the Range's keyspace like we do with lease transfers: https://github.com/cockroachdb/cockroach/blob/d03a34e92d2ee558fb6aedb0709b733a1fab97f4/pkg/storage/replica_proposal.go#L251-L253

I'm upstreaming the changes I made to filesystem_simulator that helped me reproduce this more readily and debug the resulting output: https://github.com/scaledata/rksql/pull/3.

cockroachdb / cockroach

roachtest: filesystem_simulator child does not exist in files failure #36022