Open cenkalti opened 1 year ago
The intention might be setting it as 50% * DefaultFillPercent
(50% * 0.5 = 25%)? cc @benbjohnson
I think we can add a field FillPercent
(defaults to 0.5) into both Options
and DB
. It's the default value for Bucket.FillPercent
, users can also set a different value after opening a bucket if they want.
// FillPercent sets the threshold for filling nodes when they split. By default,
// the bucket will fill to 50%, but it can be useful to increase this
// amount if you know that your write workloads are mostly append-only.
FillPercent float64
The rebalance threshold should be 50% * FillPercent
,
https://github.com/etcd-io/bbolt/blob/bd7d6e9f18bc139836feccb576339fbb8254fb41/node.go#L375
Can we make rebalance threshold configurable that defaults to 25% (to keep it backwards compatible)?
It can be set on the Bucket similar to FillPercent
field.
type Bucket struct {
// Sets the threshold for filling nodes when they split. By default,
// the bucket will fill to 50% but it can be useful to increase this
// amount if you know that your write workloads are mostly append-only.
//
// This is non-persisted across transactions so it must be set in every Tx.
FillPercent float64
// new field
RebalancePercent float64
}
I think this is simpler change to make.
Can we make rebalance threshold configurable that defaults to 25% (to keep it backwards compatible)?
More flexibility doesn't mean better. I think we would better not to expose the RebalancePercent
parameter. Usually a B-tree is kept balanced after deletion by merging or redistributing keys among siblings to maintain the d-key minimum for non-root nodes, and the d
is half the maximum number of keys.
Similarly, I think it makes sense to set the rebalancePercent
to 0.5 * FillPercent
.
I am trying to maximize page utilization for etcd's use case.
Compaction deletes revisions from "keys" bucket but pages nodes do not get rebalanced until data amount in the page node goes below 25%.
Compaction deletes revisions up to a given revision number but it skips the revision if it is the only revision of a key.
Because the revision numbers are incremental, there is no point of having free space in pages other than the latest page at the end of b+ tree.
If a page is empty, it gets removed from the tree and added to the freelist. However, these half-filled pages could stay forever and waste disk and memory space.
If my understanding is correct and we can solve this issue we can eliminate the need to defrag completely.
Write keys to etcd, fill roughly 3.5 pages. For the KV size below, one pages takes up to ~60 items.
import etcd3
client = etcd3.client()
key = "mykey"
val = "." * 10
# put few unique keys
for i in range(30):
client.put(key=key+str(i), value=val)
# overwrite same key
for i in range(180):
client.put(key=key, value=val)
After writing data, pages are:
% bbolt pages ~/projects/etcd/default.etcd/member/snap/db
ID TYPE ITEMS OVRFLW
======== ========== ====== ======
0 meta 0
1 meta 0
2 leaf 62
3 leaf 61
4 leaf 63
5 leaf 24
6 branch 4
7 leaf 10
8 free
9 free
10 free
Pages of “key” bucket:
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 0
Page ID: 0
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=7>
Freelist: <pgid=18446744073709551615>
HWM: <pgid=11>
Txn ID: 12
Checksum: 6e26f9e5b4162263
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 1
Page ID: 1
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=10>
Freelist: <pgid=18446744073709551615>
HWM: <pgid=11>
Txn ID: 11
Checksum: 03270b449e68cf59
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 7
Page ID: 7
Page Type: leaf
Total Size: 4096 bytes
Overflow pages: 0
Consumed Size: 368
Item Count: 10
"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=6,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 6
Page ID: 6
Page Type: branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 4
00000000000000025f0000000000000000: <pgid=2>
00000000000000405f0000000000000000: <pgid=4>
000000000000007f5f0000000000000000: <pgid=3>
00000000000000bc5f0000000000000000: <pgid=5>
Pages 2, 4, 3, 5 has 62, 63, 61, 24 items. The first 30 keys are unique, the rest (180 keys) are revisions of the same key.
Get latest revision:
% bin/etcdctl get '/' -w json | jq .header.revision
211
Compact to the middle of the 3rd page:
% etcdctl compact 150
compacted revision 150
After compaction “key” bucket contains these pages:
% bbolt pages ~/projects/etcd/default.etcd/member/snap/db
ID TYPE ITEMS OVRFLW
======== ========== ====== ======
0 meta 0
1 meta 0
2 free
3 free
4 free
5 leaf 24
6 leaf 10
7 leaf 30
8 free
9 leaf 38
10 branch 3
11 free
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 0
Page ID: 0
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=11>
Freelist: <pgid=18446744073709551615>
HWM: <pgid=12>
Txn ID: 18
Checksum: 13f6eecf472396c6
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 1
Page ID: 1
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=6>
Freelist: <pgid=18446744073709551615>
HWM: <pgid=12>
Txn ID: 19
Checksum: 1ef4a4bd5af5d6fa
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 6
Page ID: 6
Page Type: leaf
Total Size: 4096 bytes
Overflow pages: 0
Consumed Size: 369
Item Count: 10
"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=10,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 10
Page ID: 10
Page Type: branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 3
00000000000000025f0000000000000000: <pgid=7>
00000000000000965f0000000000000000: <pgid=9>
00000000000000bc5f0000000000000000: <pgid=5>
Pages 7, 9, 5 has 30, 38, 24 items.
1st page | 2nd page | 3rd page | 4th page |
---|---|---|---|
62 | 63 | 61 | 24 |
30 | deleted | 38 | 24 |
The 30 unique keys in the 1st page will never get deleted if they don’t get a new revision because etcd never deletes the latest revision of the key.
Since the revision number are incremental, the items in the first page will take up full page space forever unless that page's utilization goes below 25% which is not a configurable value.
The compacted/removed key (revisions in etcd) may be scattered everywhere in the db pages, but the key point is etcd always adds new k/v at the end of the last page. So for etcd, we should set a high value for both FillPercent
(e.g 1.0) and RebalancePercent
(e.g 0.9); In this way, we can eventually get rid of defragmentation in etcd 5.0+. Note we still need defragmentation in the transition stage, because some middle pages will never be rebalanced if there is no deletion in them.
The downside is there will be more rebalance operation and accordingly more page writing & syncing in each etcd compaction operation. So we need to evaluate the performance impact.
The other concern is that there are other buckets (e.g. lease, auth*) which are not append-only. Since K8s doesn't use auth, so it isn't a problem. I am not sure the exact usage of lease in K8s, e.g. how often it's created & deleted. Please also get this clarified.
Note: It only makes sense to set a high value for FillPercent
and RebalancePercent
for append (specifically, only add K/V at the tail) only use case, e.g. etcd.
Note it still makes sense to set the rebalancePercent
to 0.5 * FillPercent
by default just as my previous comment https://github.com/etcd-io/bbolt/issues/422#issuecomment-1522977076 mentioned. We just need to support customizing it so that users (e.g etcd) can set a different value based on their scenarios.
After #518 is merged, users are able to set rebalance percent to values between 0.0 and 0.5.
However, it's still not possible to set it to values larger than 0.5.
Can we consider again introducing a new field that is set to 0.25 by default?
type Bucket struct {
FillPercent float64
// new field
RebalancePercent float64
}
If you read my comment above and the new title of this issue "Support customizing the rebalance threshold
", you will get that I am not against the proposal. But please think about the concern in above comment.
Cool. I'll try to come up with a way to evaluate the performance impact.
@cenkalti are you still working on this? Please feel free to let me know if you don't have bandwidth, then I am happy to take over.
No, I am not working on this anymore.
It seems that it isn't that simple. Setting a bigger RebalancePercent
can trigger rebalance more easily, and accordingly a node may be merged with next or prev node more easily. But the merged node may be split again if we don't set a reasonable RebalancePercent
and FillPercent
,
https://github.com/etcd-io/bbolt/blob/7eb39a611dd95d7108124037c7f346b87e1662d3/node.go#L230-L246
The default values for FillPercent
and RebalancePercent
are is 50% and 25% respectively, so a merged node/page has at most 75% pageSize, so a merged node will never be split again in current transaction by default. But If we set a bigger RebalancePercent
, it may cause a merged node to be split again.
what do you guys want to optimize for? tighter packing of pages? less memory allocations?
eventually get rid of defragmentation
The goal was to eventually get rid of defragmentation
, see https://github.com/etcd-io/bbolt/issues/422#issuecomment-1555528594.
Ah I see, so we have different thresholds and then we wanted to configure them and now you've found a new parameter that needs tweaking alongside it :)
https://github.com/etcd-io/bbolt/blob/838586ae65ec4abd280c073af31bfdf4d1c5b454/node.go#L420-L424
How this number was chosen?
What happens if we make it configurable and increase or decrease it?