etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.77k stars 9.77k forks source link

compact does not effect when snapshoting one big DB. #18509

Closed qsyqian closed 2 months ago

qsyqian commented 2 months ago

Bug report criteria

What happened?

When snapshot one etcd cluster and i found that the compact does not effect.

Metrics for db and db_in_use: A36854D4-BE43-41D3-A986-41BC155E5A70

Snapshot to the etcd0 member, and we can see the etcd0's compaction take effect delayed.

Logs for etcd0: (Please note that the log time should plus 8 for metrics time)

{"level":"info","ts":"2024-08-27T06:29:00.037023Z","caller":"v3rpc/maintenance.go:126","msg":"sending database snapshot to client","total-bytes":20931088384,"size":"21 GB"}
{"level":"info","ts":"2024-08-27T06:29:07.801208Z","caller":"etcdserver/server.go:1443","msg":"triggering snapshot","local-member-id":"236527662c6439af","local-member-applied-index":1700017,"local-member-snapshot-index":1600016,"local-member-snapshot-count":100000}
{"level":"info","ts":"2024-08-27T06:29:07.819435Z","caller":"etcdserver/server.go:2462","msg":"saved snapshot","snapshot-index":1700017}
{"level":"info","ts":"2024-08-27T06:29:07.81969Z","caller":"etcdserver/server.go:2492","msg":"compacted Raft logs","compact-index":1695017}
{"level":"info","ts":"2024-08-27T06:29:13.800068Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":1597520}
{"level":"info","ts":"2024-08-27T06:29:17.435074Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":1597520,"took":"3.634133646s","hash":89323745,"current-db-size-bytes":20931088384,"current-db-size":"21 GB","current-db-size-in-use-bytes":19744907264,"current-db-size-in-use":"20 GB"}
{"level":"info","ts":"2024-08-27T06:29:17.435115Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":89323745,"revision":1597520,"compact-revision":1507949}
{"level":"info","ts":"2024-08-27T06:31:00.710983Z","caller":"v3rpc/maintenance.go:166","msg":"sending database sha256 checksum to client","total-bytes":20931088384,"checksum-size":32}
{"level":"info","ts":"2024-08-27T06:31:00.711025Z","caller":"v3rpc/maintenance.go:175","msg":"successfully sent database snapshot to client","total-bytes":20931088384,"size":"21 GB","took":"2 minutes ago"}

We can see that, after the compaction in time 29:13, the db_inuse is about "current-db-size-in-use-bytes":19744907264,"current-db-size-in-use":"20 GB".

Logs for etcd1:

{"level":"info","ts":"2024-08-27T06:29:13.791312Z","caller":"v3compactor/periodic.go:134","msg":"starting auto periodic compaction","revision":1597520,"compact-period":"5m0s"}
{"level":"info","ts":"2024-08-27T06:29:13.796022Z","caller":"v3compactor/periodic.go:142","msg":"completed auto periodic compaction","revision":1597520,"compact-period":"5m0s","took":"4.678284ms"}
{"level":"info","ts":"2024-08-27T06:29:13.796026Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":1597520}
{"level":"info","ts":"2024-08-27T06:29:17.843893Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":1597520,"took":"4.047412712s","hash":89323745,"current-db-size-bytes":21044314112,"current-db-size":"21 GB","current-db-size-in-use-bytes":10459881472,"current-db-size-in-use":"10 GB"}

Logs for etcd2:

{"level":"info","ts":"2024-08-27T06:29:13.796522Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":1597520}
{"level":"info","ts":"2024-08-27T06:29:18.119962Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":1597520,"took":"4.322978561s","hash":89323745,"current-db-size-bytes":20952608768,"current-db-size":"21 GB","current-db-size-in-use-bytes":10467917824,"current-db-size-in-use":"10 GB"}

For etcd1 and etcd2, after compaction in time 29:13, the db_inuse is about "current-db-size-in-use-bytes":10467917824,"current-db-size-in-use":"10 GB".

The more important is that, after this, the db total size for etcd0 is grow up. Please ignore the db_inuse size, it is because the client stop update and put keys.

endpoint status -w table
+---------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|                           ENDPOINT                            |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://etcd0.etcd.svc:2379 | 236527662c6439af |  3.5.13 |   24 GB |     false |      false |         8 |    2187946 |            2187946 |        |
| https://etcd1.etcd.svc:2379 |  fdd1098ba1bb471 |  3.5.13 |   21 GB |      true |      false |         8 |    2187946 |            2187946 |        |
| https://etcd2.etcd.svc:2379 | 25b4ff1e9b1f0b1b |  3.5.13 |   21 GB |     false |      false |         8 |    2187946 |            2187946 |        |
+---------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
endpoint status -w json
[{"Endpoint":"https://etcd0.etcd.svc:2379","Status":{"header":{"cluster_id":8776373056027032411,"member_id":2550488083760167343,"revision":2182271,"raft_term":8},"version":"3.5.13","dbSize":24013713408,"leader":1143088128565425265,"raftIndex":2187948,"raftTerm":8,"raftAppliedIndex":2187948,"dbSizeInUse":1077248}},{"Endpoint":"https://etcd1.etcd.svc:2379","Status":{"header":{"cluster_id":8776373056027032411,"member_id":1143088128565425265,"revision":2182271,"raft_term":8},"version":"3.5.13","dbSize":21044314112,"leader":1143088128565425265,"raftIndex":2187948,"raftTerm":8,"raftAppliedIndex":2187948,"dbSizeInUse":1077248}},{"Endpoint":"https://etcd2.etcd.svc:2379","Status":{"header":{"cluster_id":8776373056027032411,"member_id":2717076982127856411,"revision":2182271,"raft_term":8},"version":"3.5.13","dbSize":20952608768,"leader":1143088128565425265,"raftIndex":2187948,"raftTerm":8,"raftAppliedIndex":2187948,"dbSizeInUse":1077248}}]

What did you expect to happen?

the compact should take effect during snapshot.

How can we reproduce it (as minimally and precisely as possible)?

  1. have one etcd cluster tha has db size about 20Gi,db inuse size about 10Gi. (the db/inuse size numbers are not very sure)
  2. calculate the auto compact time and do snapshot to one etcd member before that time.

Anything else we need to know?

auto compaction params:

--auto-compaction-mode=periodic
--auto-compaction-retention=5m

Etcd version (please run commands below)

v3.5.13

Etcd configuration (command line flags or environment variables)

# paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

qsyqian commented 2 months ago

The second test.

Before test, compact & defrag to set the cluster to origin state.

Then send the same requests (qps and kv size) , snapshot to one etcd member. Make sure the snapshot process doesn't contain the compact operations. This is the only different.

Then see the logs. (Contains the next compact log after the snapshot operation).

Logs for etcd0:

{"level":"info","ts":"2024-08-27T09:16:00.041086Z","caller":"v3rpc/maintenance.go:126","msg":"sending database snapshot to client","total-bytes":20992471040,"size":"21 GB"}
{"level":"info","ts":"2024-08-27T09:17:51.627652Z","caller":"v3rpc/maintenance.go:166","msg":"sending database sha256 checksum to client","total-bytes":20992471040,"checksum-size":32}
{"level":"info","ts":"2024-08-27T09:17:51.627695Z","caller":"v3rpc/maintenance.go:175","msg":"successfully sent database snapshot to client","total-bytes":20992471040,"size":"21 GB","took":"1 minute ago"}
{"level":"info","ts":"2024-08-27T09:18:18.985499Z","caller":"etcdserver/server.go:1443","msg":"triggering snapshot","local-member-id":"236527662c6439af","local-member-applied-index":2600026,"local-member-snapshot-index":2500025,"local-member-snapshot-count":100000}
{"level":"info","ts":"2024-08-27T09:18:18.998421Z","caller":"etcdserver/server.go:2462","msg":"saved snapshot","snapshot-index":2600026}
{"level":"info","ts":"2024-08-27T09:18:18.998523Z","caller":"etcdserver/server.go:2492","msg":"compacted Raft logs","compact-index":2595026}
{"level":"info","ts":"2024-08-27T09:19:15.782916Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":2512762}
{"level":"info","ts":"2024-08-27T09:19:19.905582Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":2512762,"took":"4.122138217s","hash":2429156478,"current-db-size-bytes":20992471040,"current-db-size":"21 GB","current-db-size-in-use-bytes":10456432640,"current-db-size-in-use":"10 GB"}
{"level":"info","ts":"2024-08-27T09:19:19.905617Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":2429156478,"revision":2512762,"compact-revision":2423174}

Logs for etcd1:

{"level":"info","ts":"2024-08-27T09:19:15.7752Z","caller":"v3compactor/periodic.go:134","msg":"starting auto periodic compaction","revision":2512762,"compact-period":"5m0s"}
{"level":"info","ts":"2024-08-27T09:19:15.781344Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":2512762}
{"level":"info","ts":"2024-08-27T09:19:15.781408Z","caller":"v3compactor/periodic.go:142","msg":"completed auto periodic compaction","revision":2512762,"compact-period":"5m0s","took":"6.179178ms"}
{"level":"info","ts":"2024-08-27T09:19:19.785217Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":2512762,"took":"4.003381407s","hash":2429156478,"current-db-size-bytes":21074321408,"current-db-size":"21 GB","current-db-size-in-use-bytes":10453090304,"current-db-size-in-use":"10 GB"}
{"level":"info","ts":"2024-08-27T09:19:19.785252Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":2429156478,"revision":2512762,"compact-revision":2423174}

Logs for etcd2:

{"level":"info","ts":"2024-08-27T09:19:15.786751Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":2512762}
{"level":"info","ts":"2024-08-27T09:19:20.210784Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":2512762,"took":"4.423434677s","hash":2429156478,"current-db-size-bytes":20979462144,"current-db-size":"21 GB","current-db-size-in-use-bytes":10466041856,"current-db-size-in-use":"10 GB"}
{"level":"info","ts":"2024-08-27T09:19:20.210829Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":2429156478,"revision":2512762,"compact-revision":2423174}

As we can see, nothing is strange. And the db size is all same.

endpoint status  -w table
+---------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|                           ENDPOINT                            |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://etcd0.etcd.svc:2379 | 236527662c6439af |  3.5.13 |   21 GB |     false |      false |         8 |    2798807 |            2798806 |        |
| https://etcd1.etcd.svc:2379 |  fdd1098ba1bb471 |  3.5.13 |   21 GB |      true |      false |         8 |    2798812 |            2798812 |        |
| https://etcd2.etcd.svc:2379 | 25b4ff1e9b1f0b1b |  3.5.13 |   21 GB |     false |      false |         8 |    2798815 |            2798814 |        |
+---------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
endpoint status  -w json
[{"Endpoint":"https://etcd0.etcd.svc:2379","Status":{"header":{"cluster_id":8776373056027032411,"member_id":2550488083760167343,"revision":2803148,"raft_term":8},"version":"3.5.13","dbSize":20992471040,"leader":1143088128565425265,"raftIndex":2808853,"raftTerm":8,"raftAppliedIndex":2808853,"dbSizeInUse":11611930624}},{"Endpoint":"https://etcd1.etcd.svc:2379","Status":{"header":{"cluster_id":8776373056027032411,"member_id":1143088128565425265,"revision":2803151,"raft_term":8},"version":"3.5.13","dbSize":21074321408,"leader":1143088128565425265,"raftIndex":2808856,"raftTerm":8,"raftAppliedIndex":2808856,"dbSizeInUse":11611930624}},{"Endpoint":"https://etcd2.etcd.svc:2379","Status":{"header":{"cluster_id":8776373056027032411,"member_id":2717076982127856411,"revision":2803153,"raft_term":8},"version":"3.5.13","dbSize":20979462144,"leader":1143088128565425265,"raftIndex":2808861,"raftTerm":8,"raftAppliedIndex":2808858,"dbSizeInUse":11612975104}}]

It seems that only the snapshot operation duration contains one compact operation could lead to the db of this member grow up.

olderTaoist commented 2 months ago

/assign

ahrtr commented 2 months ago

Are you able to reproduce this issue? @ArkaSaha30 @ivanvc @henrybear327

qsyqian commented 2 months ago

After added log in #18516 , the logs are:

Logs for etcd0:

{"level":"info","ts":"2024-08-30T10:02:00.039084Z","caller":"v3rpc/maintenance.go:126","msg":"sending database snapshot to client","total-bytes":21776793600,"size":"22 GB"}
{"level":"info","ts":"2024-08-30T10:02:04.432765Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":3819633}
{"level":"info","ts":"2024-08-30T10:02:08.041394Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":3819633,"took":"3.608062921s","key-compactions":85268,"hash":2662401697,"current-db-size-bytes":21776793600,"current-db-size":"22 GB","current-db-size-in-use-bytes":19293433856,"current-db-size-in-use":"19 GB"}
{"level":"info","ts":"2024-08-30T10:02:08.041441Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":2662401697,"revision":3819633,"compact-revision":3734365}
{"level":"info","ts":"2024-08-30T10:03:58.659047Z","caller":"v3rpc/maintenance.go:166","msg":"sending database sha256 checksum to client","total-bytes":21776793600,"checksum-size":32}
{"level":"info","ts":"2024-08-30T10:03:58.659129Z","caller":"v3rpc/maintenance.go:175","msg":"successfully sent database snapshot to client","total-bytes":21776793600,"size":"22 GB","took":"1 minute ago"}

Logs for etcd1:

{"level":"info","ts":"2024-08-30T10:02:04.423119Z","caller":"v3compactor/periodic.go:134","msg":"starting auto periodic compaction","revision":3819633,"compact-period":"5m0s"}
{"level":"info","ts":"2024-08-30T10:02:04.431864Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":3819633}
{"level":"info","ts":"2024-08-30T10:02:04.431923Z","caller":"v3compactor/periodic.go:142","msg":"completed auto periodic compaction","revision":3819633,"compact-period":"5m0s","took":"8.7761ms"}
{"level":"info","ts":"2024-08-30T10:02:08.319213Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":3819633,"took":"3.886921185s","key-compactions":85268,"hash":2662401697,"current-db-size-bytes":20865847296,"current-db-size":"21 GB","current-db-size-in-use-bytes":10421878784,"current-db-size-in-use":"10 GB"}
{"level":"info","ts":"2024-08-30T10:02:08.319249Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":2662401697,"revision":3819633,"compact-revision":3734365}

Logs for etcd2:

{"level":"info","ts":"2024-08-30T10:02:04.429785Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":3819633}
{"level":"info","ts":"2024-08-30T10:02:08.37252Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":3819633,"took":"3.942233381s","key-compactions":85268,"hash":2662401697,"current-db-size-bytes":20844969984,"current-db-size":"21 GB","current-db-size-in-use-bytes":10423861248,"current-db-size-in-use":"10 GB"}
{"level":"info","ts":"2024-08-30T10:02:08.372556Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":2662401697,"revision":3819633,"compact-revision":3734365}

The keyCompactions is same for the three etcd members. But the db size in use for etcd0 is big different from others.

ahrtr commented 2 months ago

Could you run commands below? Please build the bbolt command from https://github.com/etcd-io/bbolt/tree/main/cmd/bbolt

$ ./bbolt stats ./new.db 
Aggregate statistics for 10 buckets

Page count statistics
    Number of logical branch pages: 24
    Number of physical branch overflow pages: 0
    Number of logical leaf pages: 1450
    Number of physical leaf overflow pages: 2467
Tree statistics
    Number of keys/value pairs: 5085
    Number of levels in B+tree: 3
Page size utilization
    Bytes allocated for physical branch pages: 98304
    Bytes actually used for branch data: 48960 (49%)
    Bytes allocated for physical leaf pages: 16044032
    Bytes actually used for leaf data: 13768079 (85%)
Bucket statistics
    Total number of buckets: 10
    Total number on inlined buckets: 8 (80%)
    Bytes used for inlined buckets: 582 (0%)

$ ./bbolt inspect ./new.db 
{
    "name": "root",
    "keyN": 0,
    "buckets": [
        {
            "name": "alarm",
            "keyN": 0
        },
        {
            "name": "auth",
            "keyN": 1
        },
        {
            "name": "authRoles",
            "keyN": 0
        },
        {
            "name": "authUsers",
            "keyN": 0
        },
        {
            "name": "cluster",
            "keyN": 1
        },
        {
            "name": "key",
            "keyN": 5042
        },
        {
            "name": "lease",
            "keyN": 35
        },
        {
            "name": "members",
            "keyN": 1
        },
        {
            "name": "members_removed",
            "keyN": 0
        },
        {
            "name": "meta",
            "keyN": 5
        }
    ]
}
qsyqian commented 2 months ago

Thanks for the reply. Please let me know if you want more informations. @ahrtr

Execute the commands after snapshotting is done(with etcd0) and stop the etcd cluster. The results are:

etcd0:

sh-4.2# ls -al db
-rw------- 1 root root 22825078784 Sep  5 02:48 db
sh-4.2# bbolt stats ./db 
Aggregate statistics for 10 buckets

Page count statistics
  Number of logical branch pages: 614
  Number of physical branch overflow pages: 0
  Number of logical leaf pages: 67282
  Number of physical leaf overflow pages: 3363950
Tree statistics
  Number of keys/value pairs: 134576
  Number of levels in B+tree: 4
Page size utilization
  Bytes allocated for physical branch pages: 2514944
  Bytes actually used for branch data: 2250360 (89%)
  Bytes allocated for physical leaf pages: 14054326272
  Bytes actually used for leaf data: 13786544128 (98%)
Bucket statistics
  Total number of buckets: 10
  Total number on inlined buckets: 9 (90%)
  Bytes used for inlined buckets: 959 (0%)
sh-4.2# bbolt inspect ./db 
{
    "name": "root",
    "keyN": 0,
    "buckets": [
        {
            "name": "alarm",
            "keyN": 0
        },
        {
            "name": "auth",
            "keyN": 1
        },
        {
            "name": "authRoles",
            "keyN": 0
        },
        {
            "name": "authUsers",
            "keyN": 0
        },
        {
            "name": "cluster",
            "keyN": 1
        },
        {
            "name": "key",
            "keyN": 134566
        },
        {
            "name": "lease",
            "keyN": 0
        },
        {
            "name": "members",
            "keyN": 3
        },
        {
            "name": "members_removed",
            "keyN": 0
        },
        {
            "name": "meta",
            "keyN": 5
        }
    ]
}

etcd1:

sh-4.2# ls -al db
-rw------- 1 root root 21282004992 Sep  5 02:47 db
sh-4.2# bbolt stats ./db
Aggregate statistics for 10 buckets

Page count statistics
  Number of logical branch pages: 614
  Number of physical branch overflow pages: 0
  Number of logical leaf pages: 67282
  Number of physical leaf overflow pages: 3363950
Tree statistics
  Number of keys/value pairs: 134576
  Number of levels in B+tree: 4
Page size utilization
  Bytes allocated for physical branch pages: 2514944
  Bytes actually used for branch data: 2250360 (89%)
  Bytes allocated for physical leaf pages: 14054326272
  Bytes actually used for leaf data: 13786544128 (98%)
Bucket statistics
  Total number of buckets: 10
  Total number on inlined buckets: 9 (90%)
  Bytes used for inlined buckets: 959 (0%)
sh-4.2# bbolt inspect ./db
{
    "name": "root",
    "keyN": 0,
    "buckets": [
        {
            "name": "alarm",
            "keyN": 0
        },
        {
            "name": "auth",
            "keyN": 1
        },
        {
            "name": "authRoles",
            "keyN": 0
        },
        {
            "name": "authUsers",
            "keyN": 0
        },
        {
            "name": "cluster",
            "keyN": 1
        },
        {
            "name": "key",
            "keyN": 134566
        },
        {
            "name": "lease",
            "keyN": 0
        },
        {
            "name": "members",
            "keyN": 3
        },
        {
            "name": "members_removed",
            "keyN": 0
        },
        {
            "name": "meta",
            "keyN": 5
        }
    ]
}

etcd2:

sh-4.2# ls -al db
-rw------- 1 root root 21267476480 Sep  5 02:47 db
sh-4.2# bbolt stats ./db 
Aggregate statistics for 10 buckets

Page count statistics
  Number of logical branch pages: 614
  Number of physical branch overflow pages: 0
  Number of logical leaf pages: 67282
  Number of physical leaf overflow pages: 3363950
Tree statistics
  Number of keys/value pairs: 134576
  Number of levels in B+tree: 4
Page size utilization
  Bytes allocated for physical branch pages: 2514944
  Bytes actually used for branch data: 2250360 (89%)
  Bytes allocated for physical leaf pages: 14054326272
  Bytes actually used for leaf data: 13786544128 (98%)
Bucket statistics
  Total number of buckets: 10
  Total number on inlined buckets: 9 (90%)
  Bytes used for inlined buckets: 959 (0%)
sh-4.2# bbolt inspect ./db 
{
    "name": "root",
    "keyN": 0,
    "buckets": [
        {
            "name": "alarm",
            "keyN": 0
        },
        {
            "name": "auth",
            "keyN": 1
        },
        {
            "name": "authRoles",
            "keyN": 0
        },
        {
            "name": "authUsers",
            "keyN": 0
        },
        {
            "name": "cluster",
            "keyN": 1
        },
        {
            "name": "key",
            "keyN": 134566
        },
        {
            "name": "lease",
            "keyN": 0
        },
        {
            "name": "members",
            "keyN": 3
        },
        {
            "name": "members_removed",
            "keyN": 0
        },
        {
            "name": "meta",
            "keyN": 5
        }
    ]
}
ivanvc commented 2 months ago

Would you still like some help reproducing this issue? I'd like to help, but I'm not sure how to simulate or get the database to the 10GB size.

qsyqian commented 2 months ago

Thanks for your help. @ivanvc

I can explain more details for reproducing this issue.

  1. create one etcd cluster with 3 nodes. Note the compaction params are --auto-compaction-mode=periodic --auto-compaction-retention=5m.
  2. send requests to this etcd cluster with benchmark put command with this repo: https://github.com/qsyqian/etcd/commit/97fae23a16abcbe7f7f0c3014e77dd1eefde7e66
    ./benchmark --cacert ./ca.pem --cert cert.pem --key key.pem --endpoints=https://etcd0:2379,https://etcd1:2379,https://etcd2:2379 --rate 300 --total 1000000 --val-size 102400  --clients 6 --conns 20 --insecure-skip-verify  put
  3. Pick one follower member(This time is etcd1), and do snapshot before auto compaction a few seconds.
    ./etcdctl snapshot  --cacert ./ca.pem --cert cert.pem --key key.pem --endpoints=https://$etcd1_IP:2379 --insecure-skip-tls-verify   save ./db
  4. check the log and the endpoint status

Logs for etcd0:

{"level":"info","ts":"2024-09-05T06:17:55.351126Z","caller":"v3compactor/periodic.go:134","msg":"starting auto periodic compaction","revision":1040875,"compact-period":"5m0s"}
{"level":"info","ts":"2024-09-05T06:17:55.358592Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":1040875}
{"level":"info","ts":"2024-09-05T06:17:55.358641Z","caller":"v3compactor/periodic.go:142","msg":"completed auto periodic compaction","revision":1040875,"compact-period":"5m0s","took":"7.483919ms"}
{"level":"info","ts":"2024-09-05T06:17:59.927419Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":1040875,"took":"4.568328034s","key-compactions":89635,"hash":1615407175,"current-db-size-bytes":20875321344,"current-db-size":"21 GB","current-db-size-in-use-bytes":10473357312,"current-db-size-in-use":"10 GB"}
{"level":"info","ts":"2024-09-05T06:17:59.927459Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":1615407175,"revision":1040875,"compact-revision":951240}

Logs for etcd1:

{"level":"info","ts":"2024-09-05T06:17:29.342478Z","caller":"v3rpc/maintenance.go:126","msg":"sending database snapshot to client","total-bytes":20874055680,"size":"21 GB"}
{"level":"info","ts":"2024-09-05T06:17:55.361292Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":1040875}
{"level":"info","ts":"2024-09-05T06:17:59.09344Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":1040875,"took":"3.731648292s","key-compactions":89635,"hash":1615407175,"current-db-size-bytes":20874055680,"current-db-size":"21 GB","current-db-size-in-use-bytes":19745542144,"current-db-size-in-use":"20 GB"}
{"level":"info","ts":"2024-09-05T06:17:59.093478Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":1615407175,"revision":1040875,"compact-revision":951240}
{"level":"info","ts":"2024-09-05T06:19:23.207856Z","caller":"v3rpc/maintenance.go:166","msg":"sending database sha256 checksum to client","total-bytes":20874055680,"checksum-size":32}
{"level":"info","ts":"2024-09-05T06:19:23.207983Z","caller":"v3rpc/maintenance.go:175","msg":"successfully sent database snapshot to client","total-bytes":20874055680,"size":"21 GB","took":"1 minute ago"}

Logs for etcd2:

{"level":"info","ts":"2024-09-05T06:17:55.364262Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":1040875}
{"level":"info","ts":"2024-09-05T06:17:59.695769Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":1040875,"took":"4.331001622s","key-compactions":89635,"hash":1615407175,"current-db-size-bytes":20951465984,"current-db-size":"21 GB","current-db-size-in-use-bytes":10465939456,"current-db-size-in-use":"10 GB"}
{"level":"info","ts":"2024-09-05T06:17:59.695805Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":1615407175,"revision":1040875,"compact-revision":951240}
endpoint status -w table
+------------------------+------------------+---------+-----------------+---------+--------+-----------------------+-------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT        |        ID        | VERSION | STORAGE VERSION | DB SIZE | IN USE | PERCENTAGE NOT IN USE | QUOTA | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+-----------------+---------+--------+-----------------------+-------+-----------+------------+-----------+------------+--------------------+--------+
| https://etcd2:2379 | c202fbb15816e8a3 |  3.5.13 |                 |   21 GB |  15 GB |                   30% |   0 B |     false |      false |         8 |    1361716 |            1361716 |        |
|  https://etcd0:2379 |  f5062531d505f9e |  3.5.13 |                 |   21 GB |  15 GB |                   30% |   0 B |      true |      false |         8 |    1361720 |            1361719 |        |
|  https://etcd1:2379 | eb74fbe807419692 |  3.5.13 |                 |   24 GB |  15 GB |                   38% |   0 B |     false |      false |         8 |    1361724 |            1361724 |        |
+------------------------+------------------+---------+-----------------+---------+--------+-----------------------+-------+-----------+------------+-----------+------------+--------------------+--------+
endpoint status -w json
[{"Endpoint":"https://etcd2:2379","Status":{"header":{"cluster_id":15796809493248337086,"member_id":13980012932417120419,"revision":1384214,"raft_term":8},"version":"3.5.13","dbSize":20951465984,"leader":1103490017819385758,"raftIndex":1384260,"raftTerm":8,"raftAppliedIndex":1384260,"dbSizeInUse":17142501376}},{"Endpoint":"https://etcd0:2379","Status":{"header":{"cluster_id":15796809493248337086,"member_id":1103490017819385758,"revision":1384218,"raft_term":8},"version":"3.5.13","dbSize":20875321344,"leader":1103490017819385758,"raftIndex":1384264,"raftTerm":8,"raftAppliedIndex":1384264,"dbSizeInUse":17143234560}},{"Endpoint":"https://etcd1:2379","Status":{"header":{"cluster_id":15796809493248337086,"member_id":16966462670184224402,"revision":1384220,"raft_term":8},"version":"3.5.13","dbSize":23535874048,"leader":1103490017819385758,"raftIndex":1384266,"raftTerm":8,"raftAppliedIndex":1384266,"dbSizeInUse":17143652352}}]
qsyqian commented 2 months ago

This comment said that:

A long running read transaction can cause the database to quickly grow.

And this comment also show that:

Long running read transactions can be problematic in Bolt because they can cause the data file to grow since their dependent pages can't be reused.

As the etcd snapshot process is holding one long running read transaction until send all the db to client. So this is an issue with BoltDB that has persisted for a very, very long time? :(

ahrtr commented 2 months ago

Execute the commands after snapshotting is done(with etcd0) and stop the etcd cluster. The results are:

Just to double confirm, when you executed the commands, you had already reproduced the issue again (the etcd0 has much bigger value for current-db-size-in-use-bytes?

qsyqian commented 2 months ago

Execute the commands after snapshotting is done(with etcd0) and stop the etcd cluster. The results are:

Just to double confirm, when you executed the commands, you had already reproduced the issue again (the etcd0 has much bigger value for current-db-size-in-use-bytes?

Yes, but unfortunately i didn't save the log. :(

If you find some strange, please tell me. I can test it again.

ahrtr commented 2 months ago

Yes, but unfortunately i didn't save the log. :(

If you find some strange, please tell me. I can test it again.

Thanks for the feedback. Please restart the problematic etcd instance after reproducing this "issue", afterwards please verify whether the "size-in-use" can be recovered back to normal automatically. Search the log "recovered v3 backend from snapshot" and check the value of backend-size-in-use-bytes, or just execute etcdctl endpoint status -w json and check the field dbSizeInUse.

https://github.com/etcd-io/etcd/blob/1d4372addcbcd7a2e00bd774bef6372394ed42d2/server/etcdserver/server.go#L524-L528

ahrtr commented 2 months ago

Please restart the problematic etcd instance after reproducing this "issue", afterwards please verify whether the "size-in-use" can be recovered back to normal automatically.

Actually you don't have to restart the etcd instance. Just executing etcdctl endpoint status -w json (and checking the field dbSizeInUse) after the snapshot operation is done should be fine. So the rough steps are below,

Please let me know the result.

qsyqian commented 2 months ago

The result is blew. Please not that the log time plus 8 is command run time.

Do snapshot with etcd1:

 date; ./etcdctl snapshot  --cacert ./ca.pem --cert cert.pem --key key.pem --endpoints=https://etcd1:2379 --insecure-skip-tls-verify   save ./db
Fri Sep  6 08:57:59 CST 2024
{"level":"info","ts":"2024-09-06T08:57:59.266841+0800","caller":"snapshot/v3_snapshot.go:67","msg":"created temporary db file","path":"./db.part"}
{"level":"info","ts":"2024-09-06T08:57:59.273356+0800","logger":"client","caller":"v3/maintenance.go:237","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-09-06T08:57:59.283075+0800","caller":"snapshot/v3_snapshot.go:75","msg":"fetching snapshot","endpoint":"https://7.8.70.3:2379"}
{"level":"info","ts":"2024-09-06T08:59:51.635947+0800","logger":"client","caller":"v3/maintenance.go:303","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-09-06T09:00:26.803606+0800","caller":"snapshot/v3_snapshot.go:90","msg":"fetched snapshot","endpoint":"https://7.8.70.3:2379","size":"21 GB","took":"2m27.53669875s","etcd-version":""}
{"level":"info","ts":"2024-09-06T09:00:26.803711+0800","caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"./db"}
Snapshot saved at ./db

logs for etcd0:

{"level":"info","ts":"2024-09-06T00:58:29.653151Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":2039307}
{"level":"info","ts":"2024-09-06T00:58:33.62825Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":2039307,"took":"3.974669443s","key-compactions":89650,"hash":4293915956,"current-db-size-bytes":20805419008,"current-db-size":"21 GB","current-db-size-in-use-bytes":10428334080,"current-db-size-in-use":"10 GB"}

logs for etcd1:(the snapshot member)

{"level":"info","ts":"2024-09-06T00:57:59.278771Z","caller":"v3rpc/maintenance.go:126","msg":"sending database snapshot to client","total-bytes":20933140480,"size":"21 GB"}
{"level":"info","ts":"2024-09-06T00:58:29.649351Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":2039307}
{"level":"info","ts":"2024-09-06T00:58:32.990668Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":2039307,"took":"3.340904508s","key-compactions":89650,"hash":4293915956,"current-db-size-bytes":20933140480,"current-db-size":"21 GB","current-db-size-in-use-bytes":19706777600,"current-db-size-in-use":"20 GB"}
{"level":"info","ts":"2024-09-06T00:58:32.990715Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":4293915956,"revision":2039307,"compact-revision":1949657}
{"level":"info","ts":"2024-09-06T00:59:51.636236Z","caller":"v3rpc/maintenance.go:166","msg":"sending database sha256 checksum to client","total-bytes":20933140480,"checksum-size":32}
{"level":"info","ts":"2024-09-06T00:59:51.636324Z","caller":"v3rpc/maintenance.go:175","msg":"successfully sent database snapshot to client","total-bytes":20933140480,"size":"21 GB","took":"1 minute ago"}

logs for etcd2:

{"level":"info","ts":"2024-09-06T00:58:29.643483Z","caller":"v3compactor/periodic.go:134","msg":"starting auto periodic compaction","revision":2039307,"compact-period":"5m0s"}
{"level":"info","ts":"2024-09-06T00:58:29.648981Z","caller":"mvcc/index.go:214","msg":"compact tree index","revision":2039307}
{"level":"info","ts":"2024-09-06T00:58:29.649036Z","caller":"v3compactor/periodic.go:142","msg":"completed auto periodic compaction","revision":2039307,"compact-period":"5m0s","took":"5.517805ms"}
{"level":"info","ts":"2024-09-06T00:58:34.370113Z","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":2039307,"took":"4.720662221s","key-compactions":89650,"hash":4293915956,"current-db-size-bytes":20828606464,"current-db-size":"21 GB","current-db-size-in-use-bytes":10451628032,"current-db-size-in-use":"10 GB"}
{"level":"info","ts":"2024-09-06T00:58:34.370151Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":4293915956,"revision":2039307,"compact-revision":1949657}

after snapshot done:

date; endpoint status -w json | python -m json.tool
Fri Sep  6 09:00:45 CST 2024
[
    {
        "Endpoint": "https://etcd2:2379",
        "Status": {
            "dbSize": 20828606464,
            "dbSizeInUse": 14477979648,
            "header": {
                "cluster_id": 15796809493248337086,
                "member_id": 13980012932417120419,
                "raft_term": 9,
                "revision": 2177902
            },
            "leader": 13980012932417120419,
            "raftAppliedIndex": 2178172,
            "raftIndex": 2178172,
            "raftTerm": 9,
            "version": "3.5.13"
        }
    },
    {
        "Endpoint": "https://etcd0:2379",
        "Status": {
            "dbSize": 20805419008,
            "dbSizeInUse": 14477979648,
            "header": {
                "cluster_id": 15796809493248337086,
                "member_id": 1103490017819385758,
                "raft_term": 9,
                "revision": 2177907
            },
            "leader": 13980012932417120419,
            "raftAppliedIndex": 2178177,
            "raftIndex": 2178177,
            "raftTerm": 9,
            "version": "3.5.13"
        }
    },
    {
        "Endpoint": "https://etcd1:2379",
        "Status": {
            "dbSize": 23295725568,
            "dbSizeInUse": 14478397440,
            "header": {
                "cluster_id": 15796809493248337086,
                "member_id": 16966462670184224402,
                "raft_term": 9,
                "revision": 2177910
            },
            "leader": 13980012932417120419,
            "raftAppliedIndex": 2178180,
            "raftIndex": 2178180,
            "raftTerm": 9,
            "version": "3.5.13"
        }
    }
]

then restart etcd1.(Before restart i don't stop the benchmark client. And this take a long time about 15m, the etcd1 doesn't running until i stop the benchmark client. Don't have time to dig it)

 date;  endpoint status -w json | python -m json.tool
Fri Sep  6 09:19:26 CST 2024
[
    {
        "Endpoint": "https://etcd2:2379",
        "Status": {
            "dbSize": 21817737216,
            "dbSizeInUse": 3679567872,
            "header": {
                "cluster_id": 15796809493248337086,
                "member_id": 13980012932417120419,
                "raft_term": 9,
                "revision": 2432069
            },
            "leader": 13980012932417120419,
            "raftAppliedIndex": 2432419,
            "raftIndex": 2432419,
            "raftTerm": 9,
            "version": "3.5.13"
        }
    },
    {
        "Endpoint": "https://etcd0:2379",
        "Status": {
            "dbSize": 20805419008,
            "dbSizeInUse": 3679567872,
            "header": {
                "cluster_id": 15796809493248337086,
                "member_id": 1103490017819385758,
                "raft_term": 9,
                "revision": 2432069
            },
            "leader": 13980012932417120419,
            "raftAppliedIndex": 2432419,
            "raftIndex": 2432419,
            "raftTerm": 9,
            "version": "3.5.13"
        }
    },
    {
        "Endpoint": "https://etcd1:2379",
        "Status": {
            "dbSize": 21817737216,
            "dbSizeInUse": 3712581632,
            "header": {
                "cluster_id": 15796809493248337086,
                "member_id": 16966462670184224402,
                "raft_term": 9,
                "revision": 2432069
            },
            "leader": 13980012932417120419,
            "raftAppliedIndex": 2432419,
            "raftIndex": 2432419,
            "raftTerm": 9,
            "version": "3.5.13"
        }
    }
]
qsyqian commented 2 months ago

I try to Summarize this issue. Please tell me if i am wrong. cc @ahrtr

  1. The compact during the snapshot process actually take effect.

The logs show that the current-db-size-in-use is different from others. From this , may be the compact is actually execute and delete the revisions not in keep(and the log in this pr #18516 also prove it ), but the long running snapshot tx hold the pages and didn't set them to free pages. After snapshot done, the read tx release pages and the dbSizeInuse is same from others.

Note the dbSizeInuse caculate method is blew, and it has close contact with free pages number.

atomic.StoreInt64(&b.sizeInUse, size-(int64(stats.FreePageN)*int64(db.Info().PageSize)))
  1. After snapshot, the etcd1 member has big dbSize than others.

    Also the reason from 1, the benchmark client continue to put and update keys and values during snapshot, the boltdb has not enough free pages to allocate, so it has to grow up the db.

ahrtr commented 2 months ago

Thanks for all the feedback, which confirmed that this isn't an issue at all. Instead, it's expected behavior.

From your feedback in https://github.com/etcd-io/etcd/issues/18509#issuecomment-2330519475, it confirmed that there is no any issue on the compaction, because all etcd members had exactly the same k/v pairs after the compactions.

The root cause is as you already pointed out in https://github.com/etcd-io/etcd/issues/18509#issuecomment-2331075925. To be clearer, there are two things to clarify:

qsyqian commented 2 months ago

Thanks. @ahrtr

Could you take a look at #18545 and #18516 ? If they are not needed, i will close them.