perf: support user-defined "guards" for sstable boundaries

petermattis commented 4 years ago

When flushing and compacting sstables, Pebble needs to decide how to break up the resulting output into sstables. We currently use two signals: sstable size, and the boundaries of sstables in the "grandparent" level (i.e. the level below the output level). These signals are reasonable for randomly distributed writes, but can cause problems for skewed writes. Consider the following LSM structure:

L0 a------------------------z
L1 ab de gh jk mn pq st vw yz

L0 contains a single sstable spanning [a,z]. L1 contains 9 sstables. A compaction from L0 to L1 will need to include all of the L1 sstables. It is possible that L0 only contains 2 keys: a and z. This problem is easy to construct if a and z are written to the memtable near each other in time and then flushed as flushing currently always produces a single sstable.

This situation appears to arise in practice within CRDB due to the presence of "local" vs "global" keys. Most operations involve only "global" keys. When a "local" key is operated on, it will end up generating an L0 sstable that spans a large swath of the global key space.

The idea behind "guards" is to allow the user (i.e. CockroachDB) some control over sstable boundaries. CockroachDB would define guards at the boundary between the local and global keyspace. It may also define guards at the boundary between SQL tables. Such guards would ensure that we're segregating sstables along the guard boundaries so that an L0 sstable can't cover a huge fraction of the key space.

Note that "guards" would almost certainly be specified via a callback and not upfront. That is, we'd want to allow the user to specify a callback which can return true if there should be an sstable boundary between two user keys.

tbg commented 4 years ago

This situation appears to arise in practice within CRDB due to the presence of "local" vs "global" keys. Most operations involve only "global" keys. When a "local" key is operated on, it will end up generating an L0 sstable that spans a large swath of the global key space.

Most operations involve a local and a global key, right? Every raft command updates LeaseAppliedState. (This doesn't change anything about the problem or your suggested solution).

petermattis commented 4 years ago

Most operations involve a local and a global key, right? Every raft command updates LeaseAppliedState. (This doesn't change anything about the problem or your suggested solution).

Yes. How much would a change here affect normal operation is a different question? Sstables below L0 are partitioned and I would expect that partitioning to reduce the impact of wide spanning tables. Partitioning L0 at the local/global boundary would allow concurrent L0->Lbase compactions in situations where they are currently blocked. This is something @sumeerbhola has mentioned recently. I think this is a fruitful area for exploration, but I'm tempering my hopes.

jbowens commented 3 years ago

Would it make sense to define guards at range boundaries or are those too ephemeral? The problem of excessively wide sstables also applies to lower levels. In the large bank import I just did, one node had much higher disk write I/O, which sustained well after the IMPORT finished.

Screen Shot 2020-08-13 at 1 38 49 PM

Here's the LSM of a 'normal' node:


__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    22 M       -    19 G       -       -       -       -    20 G       -       -       -     1.0
      0         0     0 B    0.00    20 G    60 M      15     0 B       0   371 M     987     0 B       0     0.0
      1        21    62 M    0.97   347 M     0 B       0   1.1 M       1    18 G   5.2 K    18 G       1    53.1
      2        31   153 M    0.23   164 M     0 B       0   896 K       4   516 M     112   528 M       1     3.1
      3         7   420 M    0.06   957 K   2.0 G      35   810 K       1   1.5 M       1   1.7 M       1     1.6
      4       197   3.2 G    0.05   1.6 G   1.6 G      35   142 K       1   2.9 G     198   2.9 G       1     1.8
      5     11819   691 G    1.00    16 M   4.3 T    75 K    91 M       6    16 M       1    16 M       1     1.0
      6     81616   6.9 T       -   3.6 T   3.4 T    75 K    91 M       7   6.4 T    70 K   6.4 T       1     1.8
  total     93691   7.6 T       -   7.7 T   7.7 T   150 K   186 M      20    14 T    76 K   6.5 T       6     1.8
  flush       373
compact     63999    89 G          (size == estimated-debt)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     483 K   7.5 G   41.0%  (score == hit-rate)
 tcache      68 K    40 M   99.9%  (score == hit-rate)
 titers         4
 filter         -       -   88.7%  (score == utility)

A little less than half of the data was ingested directly into L6.

Here's the LSM of the node with higher write I/O:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    84 K       -   8.0 G       -       -       -       -   8.1 G       -       -       -     1.0
      0         3   100 K    2.05   8.1 G   504 M     856     0 B       0   222 M   1.8 K     0 B       2     0.0
      1        13    36 M    0.98   208 M     0 B       0   1.7 M       1   8.8 G   2.4 K   8.9 G       1    43.1
      2         8    52 M    0.09    58 M     0 B       0   1.6 M       1   138 M      28   143 M       1     2.4
      3         3    65 M    0.03     0 B    62 M       2   2.3 M       1     0 B       0     0 B       1     0.0
      4     12405   642 G    3.54    60 M   4.1 T    73 K   120 M       4   108 M      29   108 M       1     1.8
      5     44371   1.8 T    3.54   2.7 T   3.6 T    79 K   828 G    14 K   4.8 T    95 K   4.8 T       1     1.8
      6     49079   5.3 T       -   5.3 T    25 G     556    56 M      15   9.6 T   107 K   9.6 T       1     1.8
  total    105882   7.7 T       -   7.8 T   7.8 T   153 K   828 G    14 K    22 T   206 K    14 T       8     2.8
  flush       703
compact    119148   7.6 T          (size == estimated-debt)
 memtbl         1    64 M
zmemtbl         2   128 M
   ztbl        23   1.1 G
 bcache     478 K   7.4 G   36.9%  (score == hit-rate)
 tcache      65 K    38 M   99.8%  (score == hit-rate)
 titers         9
 filter         -       -   91.4%  (score == utility)

Only 25 G were ingested into L6, forcing the rest of the ingestions into L5 and L4. More than 12 hours after the IMPORT finished, this node is still compacting to reshape the LSM.

jbowens commented 3 years ago

Maybe user-defined guards aren't a solution for that scenario.

We could try to dynamically detect these low-level sstables that are very broad relative to the distribution of keys in higher levels and split them up, anticipating that doing so will allow cheap move compactions.

petermattis commented 3 years ago

Only 25 G were ingested into L6, forcing the rest of the ingestions into L5 and L4. More than 12 hours after the IMPORT finished, this node is still compacting to reshape the LSM.

The last time I saw something similar happen, the problem was an sstable containing an overly wide range tombstone ingested into L6 that was preventing future ingestions into L6. See https://github.com/cockroachdb/cockroach/issues/44048.

jbowens commented 3 years ago

I confirmed that there are range tombstones in L6, although I'm not sure how to confirm their breadth.

                  L0          L1          L2          L3         L4          L5          L6           TOTAL
count             7           15          16          3          20547       41998       57654        120240
seq num
  smallest        19611449    1097131     274394      13619      141546      8254        5899         5899
  largest         19613557    19611446    19352573    2644259    19613558    19613559    19611448     19613559
size
  data            192 K       27 M        120 M       64 M       482 G       1.7 T       6.1 T        8.3 T
    blocks        20          1966        8628        2531       18984467    68036424    246584115    333618151
  index           674 B       74 K        329 K       77 K       560 M       2.0 G       7.1 G        9.6 G
    blocks        7           15          16          3          20547       41998       57654        120240
    top-level     0 B         0 B         0 B         0 B        0 B         0 B         0 B          0 B
  filter          1.7 K       1.7 M       4.4 M       802 K      5.4 G       20 G        638 K        25 G
  raw-key         34 K        43 M        122 M       17 M       117 G       421 G       1.5 T        2.0 T
  raw-value       503 K       35 M        192 M       64 M       474 G       1.7 T       6.0 T        8.1 T
records
  set             706         280 K       1.9 M       621 K      4.7 G       17 G        61 G         82 G
  delete          632         1.5 M       2.8 M       53 K       12 K        1.0 K       5.2 K        4.4 M
  range-delete    3           1.4 K       7.2 K       3          12 K        8           5.2 K        26 K
  merge           0           0           0           0          0           0           0            0

petermattis commented 3 years ago

I confirmed that there are range tombstones in L6, although I'm not sure how to confirm their breadth.

You can use debug pebble sstable layout -v to output the range tombstones and see if we extends to /Max on the upper bound.

sumeerbhola commented 3 years ago

Stepping back for a moment to the original motivation for this issue, now that we split flushes based on both L0 FlushSplitKeys and Lbase (grandparent) limits, I wonder why we need explicit guards.

We control width well inside Pebble, at least in theory, but ingested sstables are outside our control. Should we provide split points to the ingestion logic to ensure it does not produce wide sstables? Though if a wide sstable ends up being ingested to L6, we probably don't have enough info within Pebble to make it narrower in the first place. Also, if there is an sstable with a range tombstone being written to L6, isn't that tombstone now known to be useless? If so, should we mark it for immediate compaction?

petermattis commented 3 years ago

Stepping back for a moment to the original motivation for this issue, now that we split flushes based on both L0 FlushSplitKeys and Lbase (grandparent) limits, I wonder why we need explicit guards.

I'm not at all sure explicit guards are necessary.

We control width well inside Pebble, at least in theory, but ingested sstables are outside our control. Should we provide split points to the ingestion logic to ensure it does not produce wide sstables?

CRDB does split ingested sstables along the local/global key space. That is hardcoded, though. It is almost like we want an abstraction on top of sstable.Writer that incorporates the compaction splitting logic that Bilal is refactoring right now.

Though if a wide sstable ends up being ingested to L6, we probably don't have enough info within Pebble to make it narrower in the first place. Also, if there is an sstable with a range tombstone being written to L6, isn't that tombstone now known to be useless? If so, should we mark it for immediate compaction?

This is a good point. @jbowens did add logic recently to perform elision compactions to recompact L6 files that contain tombstones so I think the mark for compaction bit is essentially done. It is possible we should also be prioritizing compactions of L6 files containing tombstones because normally they would be lower priority than other compactions and thus get starved.

jbowens commented 3 years ago

This is a good point. @jbowens did add logic recently to perform elision compactions to recompact L6 files that contain tombstones so I think the mark for compaction bit is essentially done. It is possible we should also be prioritizing compactions of L6 files containing tombstones because normally they would be lower priority than other compactions and thus get starved.

Good point. I dumped the manifest on that problematic node, L6 has a lot of range tombstones interspersed with live data.

  added:         L6 413069:134146543<#0-#18602260>[/Table/53/1/73257433095/0/1597185668.403736843,0#0,SET-/Table/53/1/73263648701/0/1597185668.403736843,0#72057594037927935,RANGEDEL] (2020-08-13T20:10:06Z)
  added:         L6 413070:88438794<#0-#18602260>[/Table/53/1/73263648701/0/1597185668.403736843,0#18602260,RANGEDEL-/Table/53/1/73264444674/0/1597185668.403736843,0#0,SET] (2020-08-13T20:10:07Z)
  added:         L6 426841:134146900<#0-#19489728>[/Table/53/1/73264444675/0/1597185668.403736843,0#0,SET-/Table/53/1/73267655394/0/1597185668.403736843,0#72057594037927935,RANGEDEL] (2020-08-13T21:17:50Z)
  added:         L6 426849:88417360<#0-#19489728>[/Table/53/1/73267655394/0/1597185668.403736843,0#19489728,RANGEDEL-/Table/53/1/73269452800/0/1597185668.403736843,0#0,SET] (2020-08-13T21:17:52Z)
  added:         L6 393674:134146858<#0-#17394540>[/Table/53/1/73269452801/0/1597185668.403736843,0#0,SET-/Table/53/1/73272663522/0/1597185668.403736843,0#72057594037927935,RANGEDEL] (2020-08-13T18:39:02Z)
  added:         L6 393675:88415810<#0-#17394540>[/Table/53/1/73272663522/0/1597185668.403736843,0#17394540,RANGEDEL-/Table/53/1/73274460906/0/1597185668.403736843,0#0,SET] (2020-08-13T18:39:04Z)

When a range rebalances away, we take a snapshot for sending and write a range tombstone, right? Maybe we're too eager in compacting these range tombstones that are incapable of reclaiming disk space because of open snapshots. The 'compensated sizes' for these sstables are weighted as if they'll drop everything, but in reality we'll just rewrite the sstables and possibly prevent cheaper delete-only compactions.

If we disincentivized these ineffectual compactions, we could cheaply drop some of the L6 sstables with delete-only compactions, but we'd still be left with the sstable containing the range tombstone. If it no longer overlaps with L6, it might be move compacted into L6 with the broad range tombstone that blocks future ingestion into L6. Like @petermattis said, the L6 elision-only compactions are low priority right now and wouldn't trigger during an IMPORT where there's lots of automatic compaction.

Maybe this isn't a huge problem though because if the bulk of the data was already dropped from L6, the min-overlapping ratio heuristic will prioritize compacting into the smaller L6 range tombstone sstable as soon as there is any significant amount of overlapping data in L5. When that happens, the broad tombstone would be elided.

petermattis commented 3 years ago

When a range rebalances away, we take a snapshot for sending and write a range tombstone, right? Maybe we're too eager in compacting these range tombstones that are incapable of reclaiming disk space because of open snapshots. The 'compensated sizes' for these sstables are weighted as if they'll drop everything, but in reality we'll just rewrite the sstables and possibly prevent cheaper delete-only compactions.

When a range rebalances away, the leaseholder replica creates a Pebble snapshot and sends the data to the new replica using the snapshot. Only after the data has been sent is one of the existing replicas considered for deletion. I don't think the scenario you described here would happen as described. But because Pebble snapshots are global, we could be seeing something like it due to snapshots being sent for other ranges on the same node.

jbowens commented 3 years ago

CockroachDB would define guards at the boundary between the local and global keyspace. It may also define guards at the boundary between SQL tables.

I wonder if guards at SQL table boundaries might help with concurrent pre-ordered IMPORT performance. If there are no overlapping sstables in higher levels, ingested sstables slot directly into L6. I think this is typical behavior for the bank workload import, because bank produces ordered rows and only IMPORTs into a single table.

As I understand it, during an ordered IMPORT, most of the data ends up being ingested. But sometimes chunks of imported data are written through normal writes if the produced sstable is smaller than kv.bulk_io_write.small_write_size. If you're importing into multiple tables, these normal writes will likely end up in L0 sstables spanning SQL table boundaries, which will force future ingested sstables into higher levels.

In this scenario, I think SQL table boundary guards would allow most ingested sstables to slot directly into L6. If the source data for a table is split into multiple files, you can still have normal writes that end up overlapping ingested sstables, but maybe in that world it's better to drop the kv.bulk_io_write.small_write_size optimization or push it into the Pebble ingest codepath.

Here's a code comment regarding that kv.bulk_io_write.small_write_size optimization:

                // If this SST is "too small", the fixed costs associated with adding an
                // SST – in terms of triggering flushes, extra compactions, etc – would
                // exceed the savings we get from skipping regular, key-by-key writes,
                // and we're better off just putting its contents in a regular batch.
                // This isn't perfect: We're still incurring extra overhead constructing
                // SSTables just for use as a wire-format, but the rest of the
                // implementation of bulk-ingestion assumes certainly semantics of the
                // AddSSTable API - like ingest at arbitrary timestamps or collision
                // detection - making it is simpler to just always use the same API
                // and just switch how it writes its result.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!

github-actions[bot] commented 5 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!

petermattis commented 5 months ago

@jbowens Any interest in keeping this issue open? Or are approaches like #2366 more likely to be implemented?

jbowens commented 5 months ago

I think it's worth keeping open for now. I can imagine a few specific uses that might benefit from explicit guards but be challenging to handle with a generalized approach like #2366.

cockroachdb / pebble

perf: support user-defined "guards" for sstable boundaries #517