Closed cockroach-teamcity closed 1 year ago
The roachtest failed because ~backup~ restore job 890771612152594435
had yet to complete after an hour, according to the FAILED_24_run-verify-some-backups.log
. I'm going to check if this backup was stuck or if the test runner should chill out with its restore runtime expectations.
According to the logs, this particular job was attempting to restore the tpcc database:
1.unredacted/cockroach.log:I230813 07:38:42.912598 6525 1@util/log/event_log.go:32 ⋮ [n1,job=‹RESTORE id=890771612152594435›] 138 ={"Timestamp":1691912322912577158,"EventType":"restore","JobID":890771612152594435,"JobType":"RESTORE","Description":"‹RESTORE DATABASE tpcc FROM 'gs://cockroachdb-backup-testing/mixed-version/5_current-to-22.2.9_database-tpcc_full-planned-and-executed-on-22.2.9-incremental-planned-and-executed-on-current_XyDF/2023/08/13-073348.67?AUTH=implicit' WITH detached, new_db_name = 'restore_5_current_to_22_2_9_database_tpcc_full_planned_and_executed_on_22_2_9_incremental_planned_and_executed_on_current_2'›","User":"‹root›","DescriptorIDs":[188,191,193,192,197,190,196,187,189,195,194],"Status":"running"}
The test runner only initiates tpcc100 which should create a backup with less than 10GB of data. I suspect something is actually wrong with this restore. I expect a 10GB restore to take less than 5 minutes to complete.
FWIW, this has happened before: https://github.com/cockroachdb/cockroach/issues/106370#issuecomment-1628692018 and we also couldn't determine the root cause back then (internal discussion). In both instances, the cluster was running 22.2.*, which suggests this is more likely in that release series.
Sadly, the backup is gone, so we can't determine if this would reproduce if we attempted a new restore. We really need to update this test to move failed backups to a bucket with longer TTL.
@renatolabs hm, the link you posted suggest that roachtest had a different error with a clear root cause around SHOW GRANTS not working for offline descriptors. Perhaps I'm missing something?
That issue is a bit messy; most of the failures are related to the (now solved) issue of calling SHOW GRANTS
during a restore. One of the failures in that thread (the one I linked in my comment above), however, is different, and it's a restore timeout just like this one.
@adityamaru took a look at that failure at the time but I believe that nothing conclusive came out of it (see short Slack interaction about it on the link above).
Looks like the problem is on node 4. The other nodes finish their SSTable adds fairly quickly, but after a certain point node 4 is only adding them very slowly. But we're not sure yet if AddSSTable is returning slowly, or only being called slowly.
As another oddity, this appears to actually be a mixed backup test. By the time we get to the restore, all nodes are running on 22.2. Looking at the restore-mixed-version test file I'm not sure why this would be.
[mixed-version-test/24_run-verify-some-backups] 07:38:21 runner.go:377: ---------- STARTING (24): run "verify some backups" ----------
[mixed-version-test/24_run-verify-some-backups] 07:38:21 runner.go:391: binary versions: [1: 22.2, 2: 22.2, 3: 22.2, 4: 22.2]
[mixed-version-test/24_run-verify-some-backups] 07:38:21 runner.go:392: cluster versions: [1: 22.2, 2: 22.2, 3: 22.2, 4: 22.2]
[mixed-version-test/24_run-verify-some-backups] 07:38:21 mixed_version_backup.go:2019: verifying 2 out of 5 backups in mixed version
[mixed-version-test/24_run-verify-some-backups] 07:38:21 mixed_version_backup.go:1892: mixed-version: verifying 2_22.2.9-to-current_table-bank.bank_all-planned-on-22.2.9-executed-on-current
[mixed-version-test/24_run-verify-some-backups] 07:38:21 helper.go:51: running SQL statement:
CREATE DATABASE restore_2_22_2_9_to_current_table_bank_bank_all_planned_on_22_2_9_executed_on_current_1
Args: []
Node: 1
[mixed-version-test/24_run-verify-some-backups] 07:38:21 helper.go:51: running SQL statement:
SHOW BACKUP LATEST IN 'gs://cockroachdb-backup-testing/mixed-version/2_22.2.9-to-current_table-bank.bank_all-planned-on-22.2.9-executed-on-current_bOWI?AUTH=implicit' WITH check_files, encryption_passphrase = 'ALXwMsawFMFrJSAeCGWKWs1k4bLLIhueaxz'
Args: []
Node: 3
[mixed-version-test/24_run-verify-some-backups] 07:38:22 helper.go:42: running SQL statement:
RESTORE TABLE bank.bank FROM LATEST IN 'gs://cockroachdb-backup-testing/mixed-version/2_22.2.9-to-current_table-bank.bank_all-planned-on-22.2.9-executed-on-current_bOWI?AUTH=implicit' WITH detached, into_db = 'restore_2_22_2_9_to_current_table_bank_bank_all_planned_on_22_2_9_executed_on_current_1', encryption_passphrase = 'ALXwMsawFMFrJSAeCGWKWs1k4bLLIhueaxz'
Args: []
Node: 1
As a next step, I'm going to start studying the expected log lines from restore_data_processor.go to try and figure out where this delay is happening.
Makes sense! Separately, I tried twice to repro this with the same random seed, and wasn't able to; there's also nothing in the logs to give clues. I think we either ask KV to take a look - though I don't know what they'd look for, honestly - or we close CNR. Thoughts?
I'd open a tmux session to run roachtest run .... --count 20
and also ensure that the same seed is used on all iterations. It will take a couple hours, but it will be thorough :).
Could not reproduce, even running at COCKROACH_RANDOM_SEED=-7365949933588826600 roachtest run backup-restore/mixed-version --count 20
. @renatolabs , any chance you have other suggestions on how we might repro?
I ran 10 more runs of this test last afternoon and, while I also couldn't get a test failure, I noticed that in one of the runs, the tpcc database restore attempted in step 24 got dangerously close to timing out (succeeding after about 50 mins). The same restore finishes in about 2 mins in other runs.
I uploaded that backup, along with another similar backup taken in a different test run where the restore succeeds quickly, to the cockroach-tmp
bucket (with a 30 day TTL):
Restoring these backups on 1-node clusters reproduces the slowness:
$ time roachprod sql $CLUSTER -- -e "RESTORE DATABASE tpcc FROM LATEST IN 'gs://cockroach-tmp/108676/current-to-22.2.9_database-tpcc_FAST?AUTH=implicit'"
Warning: Permanently added '104.196.100.135' (ECDSA) to the list of known hosts.
job_id | status | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+----------+---------------+-------------
891490749488005121 | succeeded | 1 | 50337175 | 6028018 | 7259727053
(1 row)
Time: 66.783s
Connection to 104.196.100.135 closed.
real 1m9.286s
user 0m1.114s
sys 0m0.169s
The restore time above is what happens in most runs. However, this other backup (of a comparable tpcc database), takes much longer
$ time roachprod sql $CLUSTER -- -e "RESTORE DATABASE tpcc FROM LATEST IN 'gs://cockroach-tmp/108676/current-to-22.2.9_database-tpcc_SLOW?AUTH=implicit'"
Warning: Permanently added '104.196.100.135' (ECDSA) to the list of known hosts.
job_id | status | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+----------+---------------+-------------
891491705730957313 | succeeded | 1 | 50338675 | 6028188 | 7259791267
(1 row)
Time: 2736.658s
Connection to 104.196.100.135 closed.
real 45m39.076s
user 0m1.072s
sys 0m0.120s
Restoring this backup is very slow on all 22.2 releases I tested (including the latest one); restore finishes in under 2 mins in 23.1 releases.
Since 22.2 is still supported, might be worth digging further.
Something is up with the read path of restore. While analyzing the logs from Ben's single node repro the failure, with verbose logging on sst_batcher
, split_and_scatter_processor
and restore_data_processor
, I learned that:
sst_batcher.addSSTable()
call, suggesting the blockage is above kv’s handling of addSStable requests. Run cat cockroach.log| grep "AddSSTable" | grep -v "sending" | grep -v "and took" | tail -n 40
to confirm.restoreSpanEntry
began ingesting 2 minutes into the restore! This implies that one of the last few restoreSpanEntries
took a terribly long time to read or rewrite before ingestion. (I say last few because the restore data processor conducts work with several workers). Run the following to confirm.
> cat cockroach.log | grep "restore_data_processor"| tail -n 40
We're still unsure why this backup takes so long to restore. Recall that this backup was taken in a mixed version state: 3 nodes were on 22.2.9 and one was one 23.1. Perhaps something in the read path (some pebble iterator perhaps, the key rewriter) is tripping over these mixed version keys. The slowness is unlikely in the cloud reader path, because we're processing data in bytes -- nothing has been decoded into cockroach keys yet.
Here's a script that can be used to repro the failure:
#!/bin/bash
set -e
roachprod create $CLUSTER --nodes=1
roachprod stage $CLUSTER:1 release v22.2.13
roachprod start $CLUSTER
roachprod sql $CLUSTER:1 -- -e "SELECT * FROM crdb_internal.set_vmodule('sst_batcher=5,restore_data_processor=4,split_and_scatter_processor=5')"
roachprod sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.bulk_io_write.restore_node_concurrency=1"
roachprod adminui $CLUSTER:1 --open
time roachprod sql $CLUSTER:1 -- -e "RESTORE TABLE tpcc.order_line FROM LATEST IN 'gs://cockroach-tmp/108676/current-to-22.2.9_database-tpcc_SLOW?AUTH=implicit' AS OF SYSTEM TIME '2023-08-15 18:54:23.234682'"
roachprod run $CLUSTER:1 -- "./cockroach debug zip debug.zip --insecure"
roachprod get $CLUSTER:1 debug.zip debug.zip
Some chat gpt assisted log parsing reveals that before the slowdown, there's usually less than a second between sst_batcher.addSStable()
calls, but then during the processing of the last span in the backup:
I230817 21:19:33.886123 1486 ccl/backupccl/restore_data_processor.go:319 ⋮ [n1,job=892064802043592705] 1973 ingesting span [/Table/183/1/‹21›/‹8›/‹-94›/‹11›/‹NULL›-/Table/183/1/‹42›/‹4›/‹-953›/‹5›/‹NULL›)
about 8 sst requests are sent very far apart:
Timestamp: 21:21:53.441827, Span: [/Table/114/1/‹92›/‹10›/‹-604›/‹5›/‹0›, /Table/114/1/‹93›/‹10›/‹-2777›/‹3›/‹0›/‹NULL›) added 1 files and took 304.240529ms, Time Difference: 124.293818000 seconds
Timestamp: 21:28:25.622093, Span: [/Table/114/1/‹93›/‹10›/‹-2777›/‹4›/‹0›, /Table/114/1/‹94›/‹9›/‹-1894›/‹8›/‹0›/‹NULL›) added 1 files and took 292.107404ms, Time Difference: 392.180266000 seconds
Timestamp: 21:33:58.219608, Span: [/Table/114/1/‹94›/‹9›/‹-1894›/‹9›/‹0›, /Table/114/1/‹95›/‹8›/‹-1040›/‹6›/‹0›/‹NULL›) added 1 files and took 287.385786ms, Time Difference: 332.597515000 seconds
Timestamp: 21:40:18.729227, Span: [/Table/114/1/‹95›/‹8›/‹-1040›/‹7›/‹0›, /Table/114/1/‹96›/‹7›/‹-156›/‹4›/‹0›/‹NULL›) added 1 files and took 289.886978ms, Time Difference: 380.509619000 seconds
Timestamp: 21:46:32.679907, Span: [/Table/114/1/‹96›/‹7›/‹-156›/‹5›/‹0›, /Table/114/1/‹97›/‹7›/‹-2264›/‹13›/‹0›/‹NULL›) added 1 files and took 305.491539ms, Time Difference: 373.950680000 seconds
Timestamp: 21:53:16.360898, Span: [/Table/114/1/‹97›/‹7›/‹-2263›/‹1›/‹0›, /Table/114/1/‹98›/‹6›/‹-1401›/‹4›/‹0›/‹NULL›) added 1 files and took 288.232483ms, Time Difference: 403.680991000 seconds
Timestamp: 21:59:06.637450, Span: [/Table/114/1/‹98›/‹6›/‹-1401›/‹5›/‹0›, /Table/114/1/‹99›/‹5›/‹-590›/‹1›/‹0›/‹NULL›) added 1 files and took 294.095981ms, Time Difference: 350.276552000 seconds
Timestamp: 22:02:31.093073, Span: [/Table/114/1/‹99›/‹5›/‹-590›/‹2›/‹0›, /Table/114/1/‹99›/‹10›/‹-2›/‹9›/‹0›/‹NULL›) added 1 files and took 175.684469ms, Time Difference: 204.455623000 seconds
After those slow guys ingest, the last 50 or so requests are sent quite speedily. I suspect the decoding or rekeying logic gets slowed down by something in the backup data for this span.
Some next steps:
I can reproduce this by restoring just the full backup. The same key spans above take forever to load. I think the next step is to inspect these keys in the backup.
Removing the release blocker as we deemed the high level issue-- a 22.2 restore of a mixed version backup is slow-- should block the release.
relabeling with branch-release-22
label, as the slowness cannot be reproduced on 23.1, even when the exact same restore span entries are used with simpleImportSpans
set to true as well.
There is also no reason to believe the mixed version state of the backup had any effect here.
After a bunch more experimentation, we're passing this storage to diagnose why the ExternalSSTIterator is so slow in 22.2 in this pathological case. To copy and paste an internal slack thread:
TL;DR: iterating the ~550mb using ExternalSSTReader
in these two sstables takes 11s on 23.1 and who-knows-how-many minutes on 22.2, which is causing a RESTORE to time out. We have a test in which we open two sstables hosted on GCS using ExternalSSTReader
and iterate the returned iterator. The SSTs are 425mb and 141mb. We initially see an iteration rate around ~1-4M keys/s for the first second, until ~3.5M keys iterated, then we hit a wall, and start iterating at 6 or 7 keys per second, with the test timing out after a minute. The exact same test (just with s/MakePrettyScannerForNamedTables(/MakePrettyScannerForNamedTables(roachpb.SystemTenantID/s
to rebase it) on 23.1 doesn't do this; it passes (on my garbage home internet) in 11s. The reproduction is currently in https://github.com/cockroachdb/cockroach/pull/109475/files and can be run with ./dev test pkg/ccl/backupccl -f TestPathelogicalIngest -v --show-logs -- --test_output=streamed
.
Also worth noting that no rangekeys were present in the ssts.
I can reproduce this though when the slowdown happens varies. I think this is some artifact of slow IO from the external storage. Details below, and code changes in https://github.com/sumeerbhola/cockroach/commits/slow_external_reads
Running with only one of the sst files (either one) is fast when reading from external storage. I suspected there may be some RANGEDEL or similar data dependent artifact, so copied them locally. They contain only point keys [1], specifically SET.
Running locally is fast (see TestIterateBadFilesLocal
) -- finishes in 3s.
Added stats to compare iteration stats from local and remote run, and they are comparable. Remote run (note that in this case the slowness started after 20+s, at 3.5+M keys; the iterator stats increase slows down as expected):
restore_data_processor_test.go:260: 0.06s: read 3570000 keys reading at 163673.71 keys/s last read /Table/183/1/93/7/-2153/10/0/1692124099.559740678,1
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 1.18s: read 3570253 keys reading at 214.30 keys/s last read /Table/183/1/93/7/-2127/5/0/1692125641.385105387,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29867, 3570252), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29867, 3570252), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 1.15s: read 3570257 keys reading at 3.48 keys/s last read /Table/183/1/93/7/-2126/3/0/1692125582.145177455,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29871, 3570256), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29871, 3570256), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 79 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 1.17s: read 3570261 keys reading at 3.42 keys/s last read /Table/183/1/93/7/-2126/7/0/1692125582.145177455,0
local run (the stats at "read 3570000 keys" match exactly):
restore_data_processor_test.go:183: 0.00s: read 3570000 keys reading at 6111692.40 keys/s last read /Table/183/1/93/7/-2153/10/0/1692124099.559740678,1
restore_data_processor_test.go:189: stats: (interface (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:183: 0.03s: read 3580000 keys reading at 312886.22 keys/s last read /Table/183/1/93/7/-1144/13/0/1692124099.559740678,1
restore_data_processor_test.go:189: stats: (interface (dir, seek, step): (fwd, 30123, 3579999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30123, 3579999), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 114 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:183: 0.00s: read 3590000 keys reading at 6228752.17 keys/s last read /Table/183/1/93/7/-147/13/0/1692124099.559740678,1
[1]
pebble sstable properties ~/slow_read/891470709612773378.sst
/Users/sumeer/slow_read/891470709612773378.sst
size
file 425MB
data 425MB
blocks 14162
index 442KB
blocks 3
top-level 61B
filter 0B
raw-key 490MB
raw-value 1.3GB
pinned-key 0
pinned-val 0
point-del-key-size 0
point-del-value-size 0
records 15325010
set 15325010
delete 0
delete-sized 0
range-delete 0
range-key-set 0
range-key-unset 0
range-key-delete 0
merge 0
global-seq-num 0
pinned 0
index
key value comparer cockroach_comparator
merger -
filter -
prefix false
whole-key false
compression Snappy
options window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0;
user properties
collectors []
pebble sstable properties ~/slow_read/891470891329912835.sst
/Users/sumeer/slow_read/891470891329912835.sst
size
file 141MB
data 141MB
blocks 3714
index 116KB
blocks 1
top-level 0B
filter 0B
raw-key 180MB
raw-value 309MB
pinned-key 0
pinned-val 0
point-del-key-size 0
point-del-value-size 0
records 6535373
set 6535373
delete 0
delete-sized 0
range-delete 0
range-key-set 0
range-key-unset 0
range-key-delete 0
merge 0
global-seq-num 0
pinned 0
index
key value comparer cockroach_comparator
merger -
filter -
prefix false
whole-key false
compression Snappy
options window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0;
user properties
collectors []
Is it possible once we exhaust one of the files, we're suffering index block seeks on all subsequent keys? IIRC, this is what the externalOnly
flag was supposed to help avoid in 22.2. We wouldn't see this behavior at all on 23.1 because it uses NextPrefix, avoiding any SeekGEs during NextKey
.
something odd about the stats in the completed local run. block-bytes, which are compressed, are much higher than the bytes in the file.
restore_data_processor_test.go:183: 0.00s: read 5490000 keys reading at 6316453.03 keys/s last read /Table/183/1/99/10/-785/12/0/1692124106.331779007,1
restore_data_processor_test.go:189: stats: (interface (dir, seek, step): (fwd, 46325, 5489999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 46325, 5489999), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 2.4 G, cached 0 B)), (points: (count 5.5 M, key-bytes 115 M, value-bytes 275 M, tombstoned: 0))
I take back what I said earlier about slow IO. It isn't getting slower, but we are needing to read significantly more block-bytes, so the same speed of IO as before is no longer sufficient to proceed. Normally, we would explain this by key spans with huge values, which may be the case here, but the aggregate of 2.4G above is odd.
restore_data_processor_test.go:260: 0.11s: read 3550000 keys reading at 95161.47 keys/s last read /Table/183/1/93/6/-1126/6/0/1692124098.042275933,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29863, 3549999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3549999), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 77 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 178 M, tombstoned: 0))
restore_data_processor_test.go:260: 0.08s: read 3560000 keys reading at 119299.89 keys/s last read /Table/183/1/93/6/-120/1/0/1692124098.042275933,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29863, 3559999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3559999), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 178 M, tombstoned: 0))
restore_data_processor_test.go:260: 0.07s: read 3570000 keys reading at 144537.56 keys/s last read /Table/183/1/93/7/-2153/10/0/1692124099.559740678,1
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 10.08s: read 3570287 keys reading at 28.48 keys/s last read /Table/183/1/93/7/-2123/2/0/1692125454.143734924,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29901, 3570286), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29901, 3570286), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 83 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 10.22s: read 3570326 keys reading at 3.82 keys/s last read /Table/183/1/93/7/-2120/11/0/1692125250.630582324,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29940, 3570325), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29940, 3570325), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 89 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 10.02s: read 3570364 keys reading at 3.79 keys/s last read /Table/183/1/93/7/-2116/6/0/1692125146.357016976,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29978, 3570363), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29978, 3570363), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 94 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 10.25s: read 3570402 keys reading at 3.71 keys/s last read /Table/183/1/93/7/-2113/10/0/1692125043.149070644,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30016, 3570401), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30016, 3570401), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 99 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 10.20s: read 3570441 keys reading at 3.82 keys/s last read /Table/183/1/93/7/-2109/12/0/1692124772.855039354,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30055, 3570440), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30055, 3570440), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 105 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 10.05s: read 3570479 keys reading at 3.78 keys/s last read /Table/183/1/93/7/-2104/3/0/1692124526.230587730,0
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30093, 3570478), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30093, 3570478), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 110 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 8.17s: read 3580000 keys reading at 1165.84 keys/s last read /Table/183/1/93/7/-1144/13/0/1692124099.559740678,1
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30123, 3579999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30123, 3579999), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 114 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
restore_data_processor_test.go:260: 0.01s: read 3590000 keys reading at 1860162.30 keys/s last read /Table/183/1/93/7/-147/13/0/1692124099.559740678,1
restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30123, 3589999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30123, 3589999), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 115 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 180 M, tombstoned: 0))
IIRC, this is what the externalOnly flag was supposed to help avoid in 22.2
Do you mean forwardOnly
? I see the following code in 22.2 but unclear why it is talking about already-exhausted sstables, since the mergingIter doesn't look at the flag and simply seeks the underlying sstable iterator.
// Check for another TrySeekUsingNext optimization opportunity, currently
// specifically tailored to external iterators. This case is intended to
// trigger in instances of Seek-ing with monotonically increasing keys with
// Nexts interspersed. At the time of writing, this is the case for
// CockroachDB scans. This optimization is important for external iterators
// to avoid re-seeking within an already-exhausted sstable. It is not always
// a performance win more generally, so we restrict it to external iterators
// that are configured to only use forward positioning operations.
//
// TODO(jackson): This optimization should be obsolete once we introduce and
// use the NextPrefix iterator positioning operation.
if seekInternalIter && i.forwardOnly && lastPositioningOp != invalidatedLastPositionOp &&
i.pos == iterPosCurForward && !hasPrefix && i.iterValidityState == IterValid &&
i.cmp(key, i.iterKey.UserKey) > 0 {
flags = flags.EnableTrySeekUsingNext()
if invariants.Enabled && flags.TrySeekUsingNext() && !i.forceEnableSeekOpt && disableSeekOpt(key, uintptr(unsafe.Pointer(i))) {
flags = flags.DisableTrySeekUsingNext()
}
}
Is it possible once we exhaust one of the files, we're suffering index block seeks on all subsequent keys?
Your suspicion was correct. This is indeed what is happening: The larger sstable has the following index blocks:
445734675 index (153981)
445888661 index (114723)
446003389 top-index (61)
During the slow periods where we are reading ~40 keys in 10s, we are doing ~40 seeks. 40*100KB = 4MB, which is roughly what is being read during that interval.
The first and last key in the larger sstable are:
/Table/177/1/73/1/54/0/1692124094.525391784,0#0,SET
/Table/183/1/93/7/-2546/12/0/1692124098.042275933,0#0,SET
and in the smaller sstable:
/Table/183/1/33/5/-2281/8/0/1692124082.478596135,0#0,SET
/Table/183/1/99/10/-2/9/0/1692124106.331779007,1#0,SET
Is this problem adequately addressed with the pebble.Iterator.NextPrefix
implementation for 23.1, or are there other use cases in CockroachDB that will explicitly call SeekGE
repeatedly?
Sorry, yeah, that's the one I meant. I think the intention was to hit this case in the single-level iterator: https://github.com/cockroachdb/pebble/blob/2c3ac2dcf94b64460b9ff29b996d7cca1b2fe910/sstable/reader.go#L677-L684C4
But I don't think it works whatsoever for two-level iterators: https://github.com/cockroachdb/pebble/blob/2c3ac2dcf94b64460b9ff29b996d7cca1b2fe910/sstable/reader.go#L1598-L1605
But I don't think it works whatsoever for two-level iterators:
I'll try to fix this in a backportable way next week.
Fixed by #109862.
roachtest.backup-restore/mixed-version failed with artifacts on release-23.1.9-rc @ d417968cd3eb53414a97b29692ba61cf1466276d:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=true
,ROACHTEST_ssd=0
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/p/roachfana/teamcity-11300127-1691905326-21-n5cpu4/1691910341102/1691917022887)
/cc @cockroachdb/disaster-recovery
This test on roachdash | Improve this report!
Jira issue: CRDB-30583