etcd-io / bbolt

An embedded key/value database for Go.
https://go.etcd.io/bbolt
MIT License
8.29k stars 646 forks source link

Support customizing the rebalance threshold #422

Open cenkalti opened 1 year ago

cenkalti commented 1 year ago

https://github.com/etcd-io/bbolt/blob/838586ae65ec4abd280c073af31bfdf4d1c5b454/node.go#L420-L424

How this number was chosen?

What happens if we make it configurable and increase or decrease it?

ahrtr commented 1 year ago

The intention might be setting it as 50% * DefaultFillPercent (50% * 0.5 = 25%)? cc @benbjohnson

ahrtr commented 1 year ago

I think we can add a field FillPercent (defaults to 0.5) into both Options and DB. It's the default value for Bucket.FillPercent, users can also set a different value after opening a bucket if they want.

// FillPercent sets the threshold for filling nodes when they split. By default,
// the bucket will fill to 50%, but it can be useful to increase this
// amount if you know that your write workloads are mostly append-only.
FillPercent float64

The rebalance threshold should be 50% * FillPercent,

https://github.com/etcd-io/bbolt/blob/bd7d6e9f18bc139836feccb576339fbb8254fb41/node.go#L375

cenkalti commented 1 year ago

Can we make rebalance threshold configurable that defaults to 25% (to keep it backwards compatible)?

It can be set on the Bucket similar to FillPercent field.

type Bucket struct {

    // Sets the threshold for filling nodes when they split. By default,
    // the bucket will fill to 50% but it can be useful to increase this
    // amount if you know that your write workloads are mostly append-only.
    //
    // This is non-persisted across transactions so it must be set in every Tx.
    FillPercent float64

    // new field
    RebalancePercent float64
}

I think this is simpler change to make.

ahrtr commented 1 year ago

Can we make rebalance threshold configurable that defaults to 25% (to keep it backwards compatible)?

More flexibility doesn't mean better. I think we would better not to expose the RebalancePercent parameter. Usually a B-tree is kept balanced after deletion by merging or redistributing keys among siblings to maintain the d-key minimum for non-root nodes, and the d is half the maximum number of keys.

Similarly, I think it makes sense to set the rebalancePercent to 0.5 * FillPercent.

cenkalti commented 1 year ago

I am trying to maximize page utilization for etcd's use case.

Compaction deletes revisions from "keys" bucket but pages nodes do not get rebalanced until data amount in the page node goes below 25%.

Compaction deletes revisions up to a given revision number but it skips the revision if it is the only revision of a key.

Because the revision numbers are incremental, there is no point of having free space in pages other than the latest page at the end of b+ tree.

If a page is empty, it gets removed from the tree and added to the freelist. However, these half-filled pages could stay forever and waste disk and memory space.

If my understanding is correct and we can solve this issue we can eliminate the need to defrag completely.

cenkalti commented 1 year ago

Write keys to etcd, fill roughly 3.5 pages. For the KV size below, one pages takes up to ~60 items.

import etcd3

client = etcd3.client()

key = "mykey"
val = "." * 10

# put few unique keys
for i in range(30):
    client.put(key=key+str(i), value=val)

# overwrite same key
for i in range(180):
    client.put(key=key, value=val)

After writing data, pages are:

% bbolt pages ~/projects/etcd/default.etcd/member/snap/db
ID       TYPE       ITEMS  OVRFLW
======== ========== ====== ======
0        meta       0
1        meta       0
2        leaf       62
3        leaf       61
4        leaf       63
5        leaf       24
6        branch     4
7        leaf       10
8        free
9        free
10       free

Pages of “key” bucket:

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=7>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=11>
Txn ID:     12
Checksum:   6e26f9e5b4162263

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=10>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=11>
Txn ID:     11
Checksum:   03270b449e68cf59

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 7
Page ID:    7
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Consumed Size: 368
Item Count: 10

"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=6,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 6
Page ID:    6
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 4

00000000000000025f0000000000000000: <pgid=2>
00000000000000405f0000000000000000: <pgid=4>
000000000000007f5f0000000000000000: <pgid=3>
00000000000000bc5f0000000000000000: <pgid=5>

Pages 2, 4, 3, 5 has 62, 63, 61, 24 items. The first 30 keys are unique, the rest (180 keys) are revisions of the same key.

Get latest revision:

% bin/etcdctl get '/' -w json | jq .header.revision
211

Compact to the middle of the 3rd page:

% etcdctl compact 150
compacted revision 150

After compaction “key” bucket contains these pages:

% bbolt pages ~/projects/etcd/default.etcd/member/snap/db
ID       TYPE       ITEMS  OVRFLW
======== ========== ====== ======
0        meta       0
1        meta       0
2        free
3        free
4        free
5        leaf       24
6        leaf       10
7        leaf       30
8        free
9        leaf       38
10       branch     3
11       free
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=11>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=12>
Txn ID:     18
Checksum:   13f6eecf472396c6

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=6>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=12>
Txn ID:     19
Checksum:   1ef4a4bd5af5d6fa

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 6
Page ID:    6
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Consumed Size: 369
Item Count: 10

"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=10,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 10
Page ID:    10
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 3

00000000000000025f0000000000000000: <pgid=7>
00000000000000965f0000000000000000: <pgid=9>
00000000000000bc5f0000000000000000: <pgid=5>

Pages 7, 9, 5 has 30, 38, 24 items.

1st page 2nd page 3rd page 4th page
62 63 61 24
30 deleted 38 24

The 30 unique keys in the 1st page will never get deleted if they don’t get a new revision because etcd never deletes the latest revision of the key.

Since the revision number are incremental, the items in the first page will take up full page space forever unless that page's utilization goes below 25% which is not a configurable value.

ahrtr commented 1 year ago

The compacted/removed key (revisions in etcd) may be scattered everywhere in the db pages, but the key point is etcd always adds new k/v at the end of the last page. So for etcd, we should set a high value for both FillPercent (e.g 1.0) and RebalancePercent (e.g 0.9); In this way, we can eventually get rid of defragmentation in etcd 5.0+. Note we still need defragmentation in the transition stage, because some middle pages will never be rebalanced if there is no deletion in them.

The downside is there will be more rebalance operation and accordingly more page writing & syncing in each etcd compaction operation. So we need to evaluate the performance impact.

The other concern is that there are other buckets (e.g. lease, auth*) which are not append-only. Since K8s doesn't use auth, so it isn't a problem. I am not sure the exact usage of lease in K8s, e.g. how often it's created & deleted. Please also get this clarified.

Note: It only makes sense to set a high value for FillPercent and RebalancePercent for append (specifically, only add K/V at the tail) only use case, e.g. etcd.

ahrtr commented 1 year ago

Note it still makes sense to set the rebalancePercent to 0.5 * FillPercent by default just as my previous comment https://github.com/etcd-io/bbolt/issues/422#issuecomment-1522977076 mentioned. We just need to support customizing it so that users (e.g etcd) can set a different value based on their scenarios.

cenkalti commented 1 year ago

After #518 is merged, users are able to set rebalance percent to values between 0.0 and 0.5.

However, it's still not possible to set it to values larger than 0.5.

Can we consider again introducing a new field that is set to 0.25 by default?

type Bucket struct {
    FillPercent float64

    // new field
    RebalancePercent float64
}
ahrtr commented 1 year ago

If you read my comment above and the new title of this issue "Support customizing the rebalance threshold", you will get that I am not against the proposal. But please think about the concern in above comment.

cenkalti commented 1 year ago

Cool. I'll try to come up with a way to evaluate the performance impact.

ahrtr commented 6 months ago

@cenkalti are you still working on this? Please feel free to let me know if you don't have bandwidth, then I am happy to take over.

cenkalti commented 6 months ago

No, I am not working on this anymore.

ahrtr commented 5 months ago

It seems that it isn't that simple. Setting a bigger RebalancePercent can trigger rebalance more easily, and accordingly a node may be merged with next or prev node more easily. But the merged node may be split again if we don't set a reasonable RebalancePercent and FillPercent, https://github.com/etcd-io/bbolt/blob/7eb39a611dd95d7108124037c7f346b87e1662d3/node.go#L230-L246

The default values for FillPercent and RebalancePercent are is 50% and 25% respectively, so a merged node/page has at most 75% pageSize, so a merged node will never be split again in current transaction by default. But If we set a bigger RebalancePercent, it may cause a merged node to be split again.

tjungblu commented 5 months ago

what do you guys want to optimize for? tighter packing of pages? less memory allocations?

ahrtr commented 5 months ago

eventually get rid of defragmentation

The goal was to eventually get rid of defragmentation, see https://github.com/etcd-io/bbolt/issues/422#issuecomment-1555528594.

tjungblu commented 5 months ago

Ah I see, so we have different thresholds and then we wanted to configure them and now you've found a new parameter that needs tweaking alongside it :)