cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

roachtest: backup-restore/mixed-version failed -- a 22.2 restore of a mixed version backup (23.1,22.2) is slow #108676

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

roachtest.backup-restore/mixed-version failed with artifacts on release-23.1.9-rc @ d417968cd3eb53414a97b29692ba61cf1466276d:

(mixedversion.go:478).Run: mixed-version test failure while running step 24 (run "verify some backups"): mixed-version: waiting for job to finish: job 890771612152594435: current status "running", waiting for "succeeded"
test artifacts and logs in: /artifacts/backup-restore/mixed-version/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=true , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/p/roachfana/teamcity-11300127-1691905326-21-n5cpu4/1691910341102/1691917022887)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-30583

msbutler commented 1 year ago

The roachtest failed because ~backup~ restore job 890771612152594435 had yet to complete after an hour, according to the FAILED_24_run-verify-some-backups.log. I'm going to check if this backup was stuck or if the test runner should chill out with its restore runtime expectations.

msbutler commented 1 year ago

According to the logs, this particular job was attempting to restore the tpcc database:

1.unredacted/cockroach.log:I230813 07:38:42.912598 6525 1@util/log/event_log.go:32 ⋮ [n1,job=‹RESTORE id=890771612152594435›] 138 ={"Timestamp":1691912322912577158,"EventType":"restore","JobID":890771612152594435,"JobType":"RESTORE","Description":"‹RESTORE DATABASE tpcc FROM 'gs://cockroachdb-backup-testing/mixed-version/5_current-to-22.2.9_database-tpcc_full-planned-and-executed-on-22.2.9-incremental-planned-and-executed-on-current_XyDF/2023/08/13-073348.67?AUTH=implicit' WITH detached, new_db_name = 'restore_5_current_to_22_2_9_database_tpcc_full_planned_and_executed_on_22_2_9_incremental_planned_and_executed_on_current_2'›","User":"‹root›","DescriptorIDs":[188,191,193,192,197,190,196,187,189,195,194],"Status":"running"}

The test runner only initiates tpcc100 which should create a backup with less than 10GB of data. I suspect something is actually wrong with this restore. I expect a 10GB restore to take less than 5 minutes to complete.

renatolabs commented 1 year ago

FWIW, this has happened before: https://github.com/cockroachdb/cockroach/issues/106370#issuecomment-1628692018 and we also couldn't determine the root cause back then (internal discussion). In both instances, the cluster was running 22.2.*, which suggests this is more likely in that release series.

Sadly, the backup is gone, so we can't determine if this would reproduce if we attempted a new restore. We really need to update this test to move failed backups to a bucket with longer TTL.

msbutler commented 1 year ago

@renatolabs hm, the link you posted suggest that roachtest had a different error with a clear root cause around SHOW GRANTS not working for offline descriptors. Perhaps I'm missing something?

renatolabs commented 1 year ago

That issue is a bit messy; most of the failures are related to the (now solved) issue of calling SHOW GRANTS during a restore. One of the failures in that thread (the one I linked in my comment above), however, is different, and it's a restore timeout just like this one.

@adityamaru took a look at that failure at the time but I believe that nothing conclusive came out of it (see short Slack interaction about it on the link above).

benbardin commented 1 year ago

Looks like the problem is on node 4. The other nodes finish their SSTable adds fairly quickly, but after a certain point node 4 is only adding them very slowly. But we're not sure yet if AddSSTable is returning slowly, or only being called slowly.

Screenshot 2023-08-14 at 12 35 45 PM

As another oddity, this appears to actually be a mixed backup test. By the time we get to the restore, all nodes are running on 22.2. Looking at the restore-mixed-version test file I'm not sure why this would be.

[mixed-version-test/24_run-verify-some-backups] 07:38:21 runner.go:377: ---------- STARTING (24): run "verify some backups" ----------
[mixed-version-test/24_run-verify-some-backups] 07:38:21 runner.go:391: binary versions: [1: 22.2, 2: 22.2, 3: 22.2, 4: 22.2]
[mixed-version-test/24_run-verify-some-backups] 07:38:21 runner.go:392: cluster versions: [1: 22.2, 2: 22.2, 3: 22.2, 4: 22.2]
[mixed-version-test/24_run-verify-some-backups] 07:38:21 mixed_version_backup.go:2019: verifying 2 out of 5 backups in mixed version
[mixed-version-test/24_run-verify-some-backups] 07:38:21 mixed_version_backup.go:1892: mixed-version: verifying 2_22.2.9-to-current_table-bank.bank_all-planned-on-22.2.9-executed-on-current
[mixed-version-test/24_run-verify-some-backups] 07:38:21 helper.go:51: running SQL statement:
CREATE DATABASE restore_2_22_2_9_to_current_table_bank_bank_all_planned_on_22_2_9_executed_on_current_1
Args: []
Node: 1
[mixed-version-test/24_run-verify-some-backups] 07:38:21 helper.go:51: running SQL statement:
SHOW BACKUP LATEST IN 'gs://cockroachdb-backup-testing/mixed-version/2_22.2.9-to-current_table-bank.bank_all-planned-on-22.2.9-executed-on-current_bOWI?AUTH=implicit' WITH check_files, encryption_passphrase = 'ALXwMsawFMFrJSAeCGWKWs1k4bLLIhueaxz'
Args: []
Node: 3
[mixed-version-test/24_run-verify-some-backups] 07:38:22 helper.go:42: running SQL statement:
RESTORE TABLE bank.bank FROM LATEST IN 'gs://cockroachdb-backup-testing/mixed-version/2_22.2.9-to-current_table-bank.bank_all-planned-on-22.2.9-executed-on-current_bOWI?AUTH=implicit' WITH detached, into_db = 'restore_2_22_2_9_to_current_table_bank_bank_all_planned_on_22_2_9_executed_on_current_1', encryption_passphrase = 'ALXwMsawFMFrJSAeCGWKWs1k4bLLIhueaxz'
Args: []
Node: 1

As a next step, I'm going to start studying the expected log lines from restore_data_processor.go to try and figure out where this delay is happening.

msbutler commented 1 year ago

@benbardin something to consider: i actually think at step 24 we are still in a mixed version state. After downgrading nodes, one still needs to finalize the upgrade. See step 3 here. This restore failed here, in the InMixedVersionState before the test driver finalized any upgrade.

benbardin commented 1 year ago

Makes sense! Separately, I tried twice to repro this with the same random seed, and wasn't able to; there's also nothing in the logs to give clues. I think we either ask KV to take a look - though I don't know what they'd look for, honestly - or we close CNR. Thoughts?

msbutler commented 1 year ago

I'd open a tmux session to run roachtest run .... --count 20 and also ensure that the same seed is used on all iterations. It will take a couple hours, but it will be thorough :).

benbardin commented 1 year ago

Could not reproduce, even running at COCKROACH_RANDOM_SEED=-7365949933588826600 roachtest run backup-restore/mixed-version --count 20. @renatolabs , any chance you have other suggestions on how we might repro?

renatolabs commented 1 year ago

I ran 10 more runs of this test last afternoon and, while I also couldn't get a test failure, I noticed that in one of the runs, the tpcc database restore attempted in step 24 got dangerously close to timing out (succeeding after about 50 mins). The same restore finishes in about 2 mins in other runs.

I uploaded that backup, along with another similar backup taken in a different test run where the restore succeeds quickly, to the cockroach-tmp bucket (with a 30 day TTL):

https://console.cloud.google.com/storage/browser/cockroach-tmp/108676?project=cockroach-shared&supportedpurview=project

Restoring these backups on 1-node clusters reproduces the slowness:

$ time roachprod sql $CLUSTER -- -e "RESTORE DATABASE tpcc FROM LATEST IN 'gs://cockroach-tmp/108676/current-to-22.2.9_database-tpcc_FAST?AUTH=implicit'"
Warning: Permanently added '104.196.100.135' (ECDSA) to the list of known hosts.
        job_id       |  status   | fraction_completed |   rows   | index_entries |   bytes
---------------------+-----------+--------------------+----------+---------------+-------------
  891490749488005121 | succeeded |                  1 | 50337175 |       6028018 | 7259727053
(1 row)

Time: 66.783s

Connection to 104.196.100.135 closed.

real    1m9.286s
user    0m1.114s
sys     0m0.169s

The restore time above is what happens in most runs. However, this other backup (of a comparable tpcc database), takes much longer

$ time roachprod sql $CLUSTER -- -e "RESTORE DATABASE tpcc FROM LATEST IN 'gs://cockroach-tmp/108676/current-to-22.2.9_database-tpcc_SLOW?AUTH=implicit'"
Warning: Permanently added '104.196.100.135' (ECDSA) to the list of known hosts.
        job_id       |  status   | fraction_completed |   rows   | index_entries |   bytes
---------------------+-----------+--------------------+----------+---------------+-------------
  891491705730957313 | succeeded |                  1 | 50338675 |       6028188 | 7259791267
(1 row)

Time: 2736.658s

Connection to 104.196.100.135 closed.

real    45m39.076s
user    0m1.072s
sys     0m0.120s

Restoring this backup is very slow on all 22.2 releases I tested (including the latest one); restore finishes in under 2 mins in 23.1 releases.

Since 22.2 is still supported, might be worth digging further.

msbutler commented 1 year ago

Something is up with the read path of restore. While analyzing the logs from Ben's single node repro the failure, with verbose logging on sst_batcher, split_and_scatter_processor and restore_data_processor, I learned that:

We're still unsure why this backup takes so long to restore. Recall that this backup was taken in a mixed version state: 3 nodes were on 22.2.9 and one was one 23.1. Perhaps something in the read path (some pebble iterator perhaps, the key rewriter) is tripping over these mixed version keys. The slowness is unlikely in the cloud reader path, because we're processing data in bytes -- nothing has been decoded into cockroach keys yet.

msbutler commented 1 year ago

Here's a script that can be used to repro the failure:

#!/bin/bash

set -e

roachprod create $CLUSTER --nodes=1
roachprod stage $CLUSTER:1 release v22.2.13
roachprod start $CLUSTER
roachprod sql $CLUSTER:1 -- -e "SELECT * FROM crdb_internal.set_vmodule('sst_batcher=5,restore_data_processor=4,split_and_scatter_processor=5')"
roachprod sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.bulk_io_write.restore_node_concurrency=1"
roachprod adminui $CLUSTER:1 --open
time roachprod sql $CLUSTER:1 -- -e "RESTORE TABLE tpcc.order_line FROM LATEST IN 'gs://cockroach-tmp/108676/current-to-22.2.9_database-tpcc_SLOW?AUTH=implicit' AS OF SYSTEM TIME '2023-08-15 18:54:23.234682'"
roachprod run $CLUSTER:1 -- "./cockroach debug zip debug.zip --insecure"
roachprod get $CLUSTER:1 debug.zip debug.zip
msbutler commented 1 year ago

Some chat gpt assisted log parsing reveals that before the slowdown, there's usually less than a second between sst_batcher.addSStable() calls, but then during the processing of the last span in the backup:

I230817 21:19:33.886123 1486 ccl/backupccl/restore_data_processor.go:319 ⋮ [n1,job=892064802043592705] 1973 ingesting span [/Table/183/1/‹21›/‹8›/‹-94›/‹11›/‹NULL›-/Table/183/1/‹42›/‹4›/‹-953›/‹5›/‹NULL›)

about 8 sst requests are sent very far apart:

Timestamp: 21:21:53.441827, Span:  [/Table/114/1/‹92›/‹10›/‹-604›/‹5›/‹0›, /Table/114/1/‹93›/‹10›/‹-2777›/‹3›/‹0›/‹NULL›) added 1 files and took 304.240529ms, Time Difference: 124.293818000 seconds
Timestamp: 21:28:25.622093, Span:  [/Table/114/1/‹93›/‹10›/‹-2777›/‹4›/‹0›, /Table/114/1/‹94›/‹9›/‹-1894›/‹8›/‹0›/‹NULL›) added 1 files and took 292.107404ms, Time Difference: 392.180266000 seconds
Timestamp: 21:33:58.219608, Span:  [/Table/114/1/‹94›/‹9›/‹-1894›/‹9›/‹0›, /Table/114/1/‹95›/‹8›/‹-1040›/‹6›/‹0›/‹NULL›) added 1 files and took 287.385786ms, Time Difference: 332.597515000 seconds
Timestamp: 21:40:18.729227, Span:  [/Table/114/1/‹95›/‹8›/‹-1040›/‹7›/‹0›, /Table/114/1/‹96›/‹7›/‹-156›/‹4›/‹0›/‹NULL›) added 1 files and took 289.886978ms, Time Difference: 380.509619000 seconds
Timestamp: 21:46:32.679907, Span:  [/Table/114/1/‹96›/‹7›/‹-156›/‹5›/‹0›, /Table/114/1/‹97›/‹7›/‹-2264›/‹13›/‹0›/‹NULL›) added 1 files and took 305.491539ms, Time Difference: 373.950680000 seconds
Timestamp: 21:53:16.360898, Span:  [/Table/114/1/‹97›/‹7›/‹-2263›/‹1›/‹0›, /Table/114/1/‹98›/‹6›/‹-1401›/‹4›/‹0›/‹NULL›) added 1 files and took 288.232483ms, Time Difference: 403.680991000 seconds
Timestamp: 21:59:06.637450, Span:  [/Table/114/1/‹98›/‹6›/‹-1401›/‹5›/‹0›, /Table/114/1/‹99›/‹5›/‹-590›/‹1›/‹0›/‹NULL›) added 1 files and took 294.095981ms, Time Difference: 350.276552000 seconds
Timestamp: 22:02:31.093073, Span:  [/Table/114/1/‹99›/‹5›/‹-590›/‹2›/‹0›, /Table/114/1/‹99›/‹10›/‹-2›/‹9›/‹0›/‹NULL›) added 1 files and took 175.684469ms, Time Difference: 204.455623000 seconds

After those slow guys ingest, the last 50 or so requests are sent quite speedily. I suspect the decoding or rekeying logic gets slowed down by something in the backup data for this span.

Some next steps:

msbutler commented 1 year ago

I can reproduce this by restoring just the full backup. The same key spans above take forever to load. I think the next step is to inspect these keys in the backup.

msbutler commented 1 year ago

Removing the release blocker as we deemed the high level issue-- a 22.2 restore of a mixed version backup is slow-- should block the release.

msbutler commented 1 year ago

relabeling with branch-release-22 label, as the slowness cannot be reproduced on 23.1, even when the exact same restore span entries are used with simpleImportSpans set to true as well.

There is also no reason to believe the mixed version state of the backup had any effect here.

msbutler commented 1 year ago

After a bunch more experimentation, we're passing this storage to diagnose why the ExternalSSTIterator is so slow in 22.2 in this pathological case. To copy and paste an internal slack thread:

TL;DR: iterating the ~550mb using ExternalSSTReader in these two sstables takes 11s on 23.1 and who-knows-how-many minutes on 22.2, which is causing a RESTORE to time out. We have a test in which we open two sstables hosted on GCS using ExternalSSTReader and iterate the returned iterator. The SSTs are 425mb and 141mb. We initially see an iteration rate around ~1-4M keys/s for the first second, until ~3.5M keys iterated, then we hit a wall, and start iterating at 6 or 7 keys per second, with the test timing out after a minute. The exact same test (just with s/MakePrettyScannerForNamedTables(/MakePrettyScannerForNamedTables(roachpb.SystemTenantID/sto rebase it) on 23.1 doesn't do this; it passes (on my garbage home internet) in 11s. The reproduction is currently in https://github.com/cockroachdb/cockroach/pull/109475/files and can be run with ./dev test pkg/ccl/backupccl -f TestPathelogicalIngest -v --show-logs -- --test_output=streamed.

Also worth noting that no rangekeys were present in the ssts.

sumeerbhola commented 1 year ago

I can reproduce this though when the slowdown happens varies. I think this is some artifact of slow IO from the external storage. Details below, and code changes in https://github.com/sumeerbhola/cockroach/commits/slow_external_reads

Running with only one of the sst files (either one) is fast when reading from external storage. I suspected there may be some RANGEDEL or similar data dependent artifact, so copied them locally. They contain only point keys [1], specifically SET.

Running locally is fast (see TestIterateBadFilesLocal) -- finishes in 3s.

Added stats to compare iteration stats from local and remote run, and they are comparable. Remote run (note that in this case the slowness started after 20+s, at 3.5+M keys; the iterator stats increase slows down as expected):

restore_data_processor_test.go:260: 0.06s: read 3570000 keys     reading at 163673.71 keys/s     last read /Table/183/1/93/7/-2153/10/0/1692124099.559740678,1

restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))

restore_data_processor_test.go:260: 1.18s: read 3570253 keys     reading at 214.30 keys/s    last read /Table/183/1/93/7/-2127/5/0/1692125641.385105387,0

restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29867, 3570252), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29867, 3570252), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))

restore_data_processor_test.go:260: 1.15s: read 3570257 keys     reading at 3.48 keys/s      last read /Table/183/1/93/7/-2126/3/0/1692125582.145177455,0

restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29871, 3570256), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29871, 3570256), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 79 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))

restore_data_processor_test.go:260: 1.17s: read 3570261 keys     reading at 3.42 keys/s      last read /Table/183/1/93/7/-2126/7/0/1692125582.145177455,0

local run (the stats at "read 3570000 keys" match exactly):

restore_data_processor_test.go:183: 0.00s: read 3570000 keys     reading at 6111692.40 keys/s    last read /Table/183/1/93/7/-2153/10/0/1692124099.559740678,1

restore_data_processor_test.go:189: stats: (interface (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))

restore_data_processor_test.go:183: 0.03s: read 3580000 keys     reading at 312886.22 keys/s     last read /Table/183/1/93/7/-1144/13/0/1692124099.559740678,1

restore_data_processor_test.go:189: stats: (interface (dir, seek, step): (fwd, 30123, 3579999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30123, 3579999), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 114 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))

restore_data_processor_test.go:183: 0.00s: read 3590000 keys     reading at 6228752.17 keys/s    last read /Table/183/1/93/7/-147/13/0/1692124099.559740678,1

[1]

pebble sstable properties ~/slow_read/891470709612773378.sst
/Users/sumeer/slow_read/891470709612773378.sst
size                    
  file                  425MB
  data                  425MB
    blocks              14162
  index                 442KB
    blocks              3
    top-level           61B
  filter                0B
  raw-key               490MB
  raw-value             1.3GB
  pinned-key            0
  pinned-val            0
  point-del-key-size    0
  point-del-value-size  0
records                 15325010
  set                   15325010
  delete                0
  delete-sized          0
  range-delete          0
  range-key-set         0
  range-key-unset       0
  range-key-delete      0
  merge                 0
  global-seq-num        0
  pinned                0
index                   
  key                     value  comparer  cockroach_comparator
merger                  -
filter                  -
  prefix                false
  whole-key             false
compression             Snappy
  options               window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; 
user properties         
  collectors            []
pebble sstable properties ~/slow_read/891470891329912835.sst 
/Users/sumeer/slow_read/891470891329912835.sst
size                    
  file                  141MB
  data                  141MB
    blocks              3714
  index                 116KB
    blocks              1
    top-level           0B
  filter                0B
  raw-key               180MB
  raw-value             309MB
  pinned-key            0
  pinned-val            0
  point-del-key-size    0
  point-del-value-size  0
records                 6535373
  set                   6535373
  delete                0
  delete-sized          0
  range-delete          0
  range-key-set         0
  range-key-unset       0
  range-key-delete      0
  merge                 0
  global-seq-num        0
  pinned                0
index                   
  key                     value  comparer  cockroach_comparator
merger                  -
filter                  -
  prefix                false
  whole-key             false
compression             Snappy
  options               window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; 
user properties         
  collectors            []
jbowens commented 1 year ago

Is it possible once we exhaust one of the files, we're suffering index block seeks on all subsequent keys? IIRC, this is what the externalOnly flag was supposed to help avoid in 22.2. We wouldn't see this behavior at all on 23.1 because it uses NextPrefix, avoiding any SeekGEs during NextKey.

sumeerbhola commented 1 year ago

something odd about the stats in the completed local run. block-bytes, which are compressed, are much higher than the bytes in the file.

    restore_data_processor_test.go:183: 0.00s: read 5490000 keys     reading at 6316453.03 keys/s    last read /Table/183/1/99/10/-785/12/0/1692124106.331779007,1
    restore_data_processor_test.go:189: stats: (interface (dir, seek, step): (fwd, 46325, 5489999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 46325, 5489999), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 2.4 G, cached 0 B)), (points: (count 5.5 M, key-bytes 115 M, value-bytes 275 M, tombstoned: 0))

I take back what I said earlier about slow IO. It isn't getting slower, but we are needing to read significantly more block-bytes, so the same speed of IO as before is no longer sufficient to proceed. Normally, we would explain this by key spans with huge values, which may be the case here, but the aggregate of 2.4G above is odd.

    restore_data_processor_test.go:260: 0.11s: read 3550000 keys     reading at 95161.47 keys/s      last read /Table/183/1/93/6/-1126/6/0/1692124098.042275933,0
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29863, 3549999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3549999), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 77 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 178 M, tombstoned: 0))
    restore_data_processor_test.go:260: 0.08s: read 3560000 keys     reading at 119299.89 keys/s     last read /Table/183/1/93/6/-120/1/0/1692124098.042275933,0
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29863, 3559999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3559999), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 178 M, tombstoned: 0))
    restore_data_processor_test.go:260: 0.07s: read 3570000 keys     reading at 144537.56 keys/s     last read /Table/183/1/93/7/-2153/10/0/1692124099.559740678,1
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29863, 3569999), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 78 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
    restore_data_processor_test.go:260: 10.08s: read 3570287 keys    reading at 28.48 keys/s     last read /Table/183/1/93/7/-2123/2/0/1692125454.143734924,0
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29901, 3570286), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29901, 3570286), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 83 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
    restore_data_processor_test.go:260: 10.22s: read 3570326 keys    reading at 3.82 keys/s      last read /Table/183/1/93/7/-2120/11/0/1692125250.630582324,0
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29940, 3570325), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29940, 3570325), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 89 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
    restore_data_processor_test.go:260: 10.02s: read 3570364 keys    reading at 3.79 keys/s      last read /Table/183/1/93/7/-2116/6/0/1692125146.357016976,0
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 29978, 3570363), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 29978, 3570363), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 94 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
    restore_data_processor_test.go:260: 10.25s: read 3570402 keys    reading at 3.71 keys/s      last read /Table/183/1/93/7/-2113/10/0/1692125043.149070644,0
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30016, 3570401), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30016, 3570401), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 99 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
    restore_data_processor_test.go:260: 10.20s: read 3570441 keys    reading at 3.82 keys/s      last read /Table/183/1/93/7/-2109/12/0/1692124772.855039354,0
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30055, 3570440), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30055, 3570440), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 105 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
    restore_data_processor_test.go:260: 10.05s: read 3570479 keys    reading at 3.78 keys/s      last read /Table/183/1/93/7/-2104/3/0/1692124526.230587730,0
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30093, 3570478), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30093, 3570478), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 110 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
    restore_data_processor_test.go:260: 8.17s: read 3580000 keys     reading at 1165.84 keys/s   last read /Table/183/1/93/7/-1144/13/0/1692124099.559740678,1
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30123, 3579999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30123, 3579999), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 114 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 179 M, tombstoned: 0))
    restore_data_processor_test.go:260: 0.01s: read 3590000 keys     reading at 1860162.30 keys/s    last read /Table/183/1/93/7/-147/13/0/1692124099.559740678,1
    restore_data_processor_test.go:266: stats: (interface (dir, seek, step): (fwd, 30123, 3589999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 30123, 3589999), (rev, 0, 0)),
        (internal-stats: (block-bytes: (total 115 M, cached 0 B)), (points: (count 3.6 M, key-bytes 74 M, value-bytes 180 M, tombstoned: 0))
sumeerbhola commented 1 year ago

IIRC, this is what the externalOnly flag was supposed to help avoid in 22.2

Do you mean forwardOnly? I see the following code in 22.2 but unclear why it is talking about already-exhausted sstables, since the mergingIter doesn't look at the flag and simply seeks the underlying sstable iterator.

    // Check for another TrySeekUsingNext optimization opportunity, currently
    // specifically tailored to external iterators. This case is intended to
    // trigger in instances of Seek-ing with monotonically increasing keys with
    // Nexts interspersed. At the time of writing, this is the case for
    // CockroachDB scans. This optimization is important for external iterators
    // to avoid re-seeking within an already-exhausted sstable. It is not always
    // a performance win more generally, so we restrict it to external iterators
    // that are configured to only use forward positioning operations.
    //
    // TODO(jackson): This optimization should be obsolete once we introduce and
    // use the NextPrefix iterator positioning operation.
    if seekInternalIter && i.forwardOnly && lastPositioningOp != invalidatedLastPositionOp &&
        i.pos == iterPosCurForward && !hasPrefix && i.iterValidityState == IterValid &&
        i.cmp(key, i.iterKey.UserKey) > 0 {
        flags = flags.EnableTrySeekUsingNext()
        if invariants.Enabled && flags.TrySeekUsingNext() && !i.forceEnableSeekOpt && disableSeekOpt(key, uintptr(unsafe.Pointer(i))) {
            flags = flags.DisableTrySeekUsingNext()
        }
    }

Is it possible once we exhaust one of the files, we're suffering index block seeks on all subsequent keys?

Your suspicion was correct. This is indeed what is happening: The larger sstable has the following index blocks:

 445734675  index (153981)
 445888661  index (114723)
 446003389  top-index (61)

During the slow periods where we are reading ~40 keys in 10s, we are doing ~40 seeks. 40*100KB = 4MB, which is roughly what is being read during that interval.

The first and last key in the larger sstable are:

/Table/177/1/73/1/54/0/1692124094.525391784,0#0,SET
/Table/183/1/93/7/-2546/12/0/1692124098.042275933,0#0,SET

and in the smaller sstable:

/Table/183/1/33/5/-2281/8/0/1692124082.478596135,0#0,SET
/Table/183/1/99/10/-2/9/0/1692124106.331779007,1#0,SET

Is this problem adequately addressed with the pebble.Iterator.NextPrefix implementation for 23.1, or are there other use cases in CockroachDB that will explicitly call SeekGE repeatedly?

jbowens commented 1 year ago

Sorry, yeah, that's the one I meant. I think the intention was to hit this case in the single-level iterator: https://github.com/cockroachdb/pebble/blob/2c3ac2dcf94b64460b9ff29b996d7cca1b2fe910/sstable/reader.go#L677-L684C4

But I don't think it works whatsoever for two-level iterators: https://github.com/cockroachdb/pebble/blob/2c3ac2dcf94b64460b9ff29b996d7cca1b2fe910/sstable/reader.go#L1598-L1605

sumeerbhola commented 1 year ago

But I don't think it works whatsoever for two-level iterators:

I'll try to fix this in a backportable way next week.

renatolabs commented 1 year ago

Fixed by #109862.