etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.84k stars 9.77k forks source link

Etcd size sometimes starts growing and grows until "mvcc: database space exceeded" #8009

Closed wojtek-t closed 7 years ago

wojtek-t commented 7 years ago

We have observed already few cases, where suddenly (after days of running a GKE cluster), the size of database starts growing.

As an example, we have a cluster, that was running without any issues for ~2 weeks (size of database was ~16MB), and then its database started growing. The growth wasn't immediate - it took ~2days, before it reached 4GB limit and there were steps of growing. For the reference we have backups (snapshots) (done via etcdctl snapshot) that reflect the growth speed (the name contains the time when it was made)

  ... // all snapshots are roughly 16MB
  2017-05-24T04:57:24-07:00_snapshot.db 16,27 MB
  2017-05-24T05:57:26-07:00_snapshot.db 29,06 MB
  2017-05-24T06:57:30-07:00_snapshot.db 108,98 MB
  2017-05-24T07:57:36-07:00_snapshot.db 177,57 MB
  2017-05-24T08:57:51-07:00_snapshot.db 308,4 MB        
  2017-05-24T09:58:32-07:00_snapshot.db 534,54 MB
  2017-05-24T11:00:16-07:00_snapshot.db 655,73 MB
  2017-05-24T12:00:55-07:00_snapshot.db 764,22 MB
  ... // all snapshots of the same size
  2017-05-25T15:15:10-07:00_snapshot.db 764,22 MB
  2017-05-25T16:16:25-07:00_snapshot.db 818,14 MB
  2017-05-25T17:26:35-07:00_snapshot.db 963,93 MB
  ... // all snapshots of the same size
  2017-05-25T22:25:08-07:00_snapshot.db 963,93 MB
  2017-05-25T23:27:03-07:00_snapshot.db 1,56 GB
  2017-05-26T00:30:13-07:00_snapshot.db 1,56 GB
  2017-05-26T01:05:24-07:00_snapshot.db 1,56 GB
  2017-05-26T02:24:21-07:00_snapshot.db 2,18 GB
  ... // all snapshots of the same size
  2017-05-26T08:43:07-07:00_snapshot.db 2,18 GB
  2017-05-26T09:46:47-07:00_snapshot.db 2,19 GB
  ... // all snapshots of the same size
  2017-05-26T16:11:31-07:00_snapshot.db 2,19 GB
  2017-05-26T17:16:47-07:00_snapshot.db 2,65 GB
  2017-05-26T18:22:37-07:00_snapshot.db 3,12 GB
  2017-05-26T19:29:07-07:00_snapshot.db 3,86 GB
  2017-05-26T20:33:24-07:00_snapshot.db 4,6 GB
  <boom>

We've checked that we were doing compaction very regularly every 5m for the whole time - so it doesn't seem to be the same as: https://github.com/coreos/etcd/issues/7944 I'm attaching the interested lines from etcd logs in etcd-compaction.txt

[Note time in that logs are in UTC, and time in snapshot names is PST, so 7 hours difference]

To summarize, the compaction was always at most few thousands of transactions (so it's not that we did a lot during some 5m period), though there were some longer compactions, up to ~7s.

I started digging into individual snapshots and found some strange thing (I was using bolt)

  1. 16MB snapshot:
    
    Aggregate statistics for 10 buckets

Page count statistics Number of logical branch pages: 10 Number of physical branch overflow pages: 0 Number of logical leaf pages: 789 Number of physical leaf overflow pages: 518 Tree statistics Number of keys/value pairs: 1667 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 40960 Bytes actually used for branch data: 26494 (64%) Bytes allocated for physical leaf pages: 5353472 Bytes actually used for leaf data: 3411680 (63%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 536 (0%)

2. 534MB snapshot (5 hours later):

Aggregate statistics for 10 buckets

Page count statistics Number of logical branch pages: 65 Number of physical branch overflow pages: 0 Number of logical leaf pages: 5559 Number of physical leaf overflow pages: 107743 Tree statistics Number of keys/value pairs: 13073 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 266240 Bytes actually used for branch data: 186912 (70%) Bytes allocated for physical leaf pages: 464084992 Bytes actually used for leaf data: 451590110 (97%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 536 (0%)

3.  1.56GB snapshot (another ~36 hours later):

Aggregate statistics for 10 buckets

Page count statistics Number of logical branch pages: 70 Number of physical branch overflow pages: 0 Number of logical leaf pages: 4525 Number of physical leaf overflow pages: 115179 Tree statistics Number of keys/value pairs: 10978 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 286720 Bytes actually used for branch data: 152723 (53%) Bytes allocated for physical leaf pages: 490307584 Bytes actually used for leaf data: 478196884 (97%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 536 (0%)

4. 3.86GB snapshot (another ~18 hours later)

Aggregate statistics for 10 buckets

Page count statistics Number of logical branch pages: 90 Number of physical branch overflow pages: 0 Number of logical leaf pages: 6219 Number of physical leaf overflow pages: 6791 Tree statistics Number of keys/value pairs: 15478 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 368640 Bytes actually used for branch data: 209621 (56%) Bytes allocated for physical leaf pages: 53288960 Bytes actually used for leaf data: 36704465 (68%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 536 (0%)

5. 4.6GB snapshot (1hour later, right before exceeding space):

Aggregate statistics for 10 buckets

Page count statistics Number of logical branch pages: 89 Number of physical branch overflow pages: 0 Number of logical leaf pages: 6074 Number of physical leaf overflow pages: 6713 Tree statistics Number of keys/value pairs: 15173 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 364544 Bytes actually used for branch data: 204788 (56%) Bytes allocated for physical leaf pages: 52375552 Bytes actually used for leaf data: 36092789 (68%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 564 (0%)



What is extremely interesting to me is that both:
- Number of physical leaf overflow pages
- Bytes allocated for physical leaf pages
dropped by order of magnitude in this 3.86GB snapshot, but the total size of database didn't drop

Unfortunately I can't provide any of those snapshots due to privacy reasons, but maybe you can see anything that we can investigate (or share results of some commands) that can help with debugging?

@xiang90 @hongchaodeng @mml @lavalamp
wojtek-t commented 7 years ago

FYI, restoring from all those snapshots locally and running:

ETCDCTL_API=3 etcdctl get /aaa /zzz > out.txt
ls -la out.txt

always returns files of size in the order of at most tens of MB.

armstrongli commented 7 years ago

@wojtek-t We got the similar issue on the DB file size increasement. What we have done is to do defrag periodically. The DB file size decreases a lot(from 4.7GB to ~200MB).

xiang90 commented 7 years ago

@benbjohnson can you take a look at the boltdb data? We double checked the etcd txn, and do not feel we can have a long running transaction that can cause the db size continuous to grow.

xiang90 commented 7 years ago

@wojtek-t

2017-05-25T23:27:03-07:00_snapshot.db 1,56 GB 2017-05-26T00:30:13-07:00_snapshot.db 1,56 GB 2017-05-26T01:05:24-07:00_snapshot.db 1,56 GB

Do you know why the snapshot interval changed here? it was ~1hr per snapshot. but it is only 30min between the last two.

benbjohnson commented 7 years ago

@xiang90 The snapshot and batchTxBufferred seem to be the only places using bolt.Tx.Begin(false). It seems like it could get hung in a spot such as newSnapshotReaderCloser() where WriteTo() gets blocked on the output writer.

Can you add logging to print the Tx pointer (using the %p formatter) when you open and close a snapshot read transaction?

https://github.com/coreos/etcd/blob/master/mvcc/backend/backend.go#L173

https://github.com/coreos/etcd/blob/master/mvcc/backend/backend.go#L439-L443

I'm not sure how much it'll help for me to look at the DB. Also, I am pretty slammed right now and don't have the bandwidth. My suspicion is that it's the snapshot getting hung or somehow the Tx is getting dropped. There are some spots where a panic() could prevent the Tx from being cleaned up but I couldn't tell if you (or an underlying library like net/http) has a recover() function masking that issue.

wojtek-t commented 7 years ago

Do you know why the snapshot interval changed here? it was ~1hr per snapshot. but it is only 30min between the last two.

Good question. Maybe the underlying VM was restarted or sth? But that's just guessing, I don't really know.

xiang90 commented 7 years ago

@wojtek-t As @benbjohnson mentioned, if there is a pending snapshot for a long time, we will have troubles. so i want you to check if there is any failed or pending snapshot. if there is not, we probably can rule out the long running txn thing. we also add quite a few warning lines around long running read only txn in etcd master. if you can reproduce the issue, i would suggest you to back port the debugging patch and give it a try.

wojtek-t commented 7 years ago

@xiang90 - are there any logs that may confirm it in 3.0.17? We don't have reproduction for it (and backporting is hard on our side).

xiang90 commented 7 years ago

are there any logs that may confirm it in 3.0.17

sorry. not really. :(

We don't have reproduction for it

what we plan to do at our side is to set up a few long running k8s stressful clusters against etcd3 while doing 10minute interval snapshot and some levels of controlled failure injection. Not sure if we can get a reproduce though.

But, for 3.0, we are at a bad situation here. No reproduce, not enough logging. We have some guess around what happened, but they are not super convincing. The only way to move forward is probably to get more debugging logging and reproduce it.

@armstrongli if you guys can reproduce it, it would be super helpful.

wojtek-t commented 7 years ago

@xiang90 - do you know why between 1.56GB and 3.8GB the number of:

dropped by order of magnitude, but the size of DB more the doubled? That seems pretty strange to me....

wojtek-t commented 7 years ago

what we plan to do at our side is to set up a few long running k8s stressful clusters against etcd3 while doing 10minute interval snapshot and some levels of controlled failure injection. Not sure if we can get a reproduce though.

Ironicaly, so far those problems are happening on small clusters (less than 10 nodes). And there isn't big churn on them (at least on some of them). So we don't even need scale for this...

xiang90 commented 7 years ago

Ironicaly, so far those problems are happening on small clusters (less than 10 nodes).

Hmm.... Is it because most of GKE users run small clusters? Or the bad percentage is indeed larger for small (low activity) clusters?

wojtek-t commented 7 years ago

Hmm.... Is it because most of GKE users run small clusters? Or the bad percentage is indeed larger for small (low activity) clusters?

Can be both. What I wanted to say is just potentially it may not be easier to reproduce in scale than in small clusters.

wojtek-t commented 7 years ago

@xiang90 - I looked into logs of our backup tasks. Unfortunately, it seems that they were log-rotated in the meantime, but we have something. In particular I'm seeing this:

E0526 08:22:32.952657       5 main.go:209] stdout from failed etcdctl backup:
E0526 08:22:32.952698       5 main.go:210] stderr from failed etcdctl backup: 2017/05/26 08:22:30 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:2379" <nil>}
2017/05/26 08:22:31 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:2379" <nil>}
Error:  grpc: timed out when dialing

Those times are UTC, so this happened during an hour when DB size grown from 1.56GB to 2.18GB. I don't see any other failures later, though I don't know what happened before.

xiang90 commented 7 years ago

dropped by order of magnitude, but the size of DB more the doubled?

boltdb neve reclaims the pages. What you see there are pages in use, there is also a freelist that tracks all freed pages. for 3.8gb one, i assume most of the pages become free?

one actually interesting point of your data is that the overflow pages are huge after 15.6MB one and becomes normal in the 3.8gb one.

in 1.56 gb one, for example, the overflow page/logic page ~ 20, which means the avg size of the key is (4000+ 115179) * 4kb/10978. is that what you expect?

xiang90 commented 7 years ago

E0526 08:22:32.952657 5 main.go:209] stdout from failed etcdctl backup:

so we do have failed snapshot?

wojtek-t commented 7 years ago

so we do have failed snapshot?

It failed on the client side - I don't know what happened on the server side (if it even reached the server or not).

xiang90 commented 7 years ago

FYI, restoring from all those snapshots locally and running:

ETCDCTL_API=3 etcdctl get /aaa /zzz > out.txt ls -la out.txt always returns files of size in the order of at most tens of MB.

This is also strange. For the 500MB one, the actual usage is 97%, which is (451590110). Probably it worth to look at the data through https://github.com/br0xen/boltbrowser to see what are in there.

I am curious how 10978 k8s keys take up to ~450MB.

xiang90 commented 7 years ago

E0526 08:22:32.952698 5 main.go:210] stderr from failed etcdctl backup: 2017/05/26 08:22:30 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:2379" }

do you have server side logging around the same time?

xiang90 commented 7 years ago

What I wanted to say is just potentially it may not be easier to reproduce in scale than in small clusters.

correct. we will try that. it would be helpful if you guys can also do some testing. i believe google probably has more computation resources than us :P.

wojtek-t commented 7 years ago

do you have server side logging around the same time?

I do - what should I look for?

xiang90 commented 7 years ago

I do - what should I look for?

any panics? any warning or error level logging?

wojtek-t commented 7 years ago

I don't see any panic and warnings. But it seems that I missed something important initially. It seems that etcd was restarted multiple times in the meantime. It was also restarted around that time. I'm trying to figure out why....

wojtek-t commented 7 years ago

So I looked deeper into what was happening with this cluster and it was very heavily overloaded. In particular we were restarting different components every now and then do to them not responding to liveness probes (etcd) or not being able to refresh master leases.

Regarding the large objects - those were huge endpoints containing O(1000) failed pods. This is k8s issue and I will file a separate issue in k8s repo for it, but it should cause etcd database space exceeded in my opinion.

wojtek-t commented 7 years ago

So I started looking into other clusters in which we faced the same issue, and it seems that there are some long-running "etcdctl snapshot save" operations there.

In particular, examples from logs from other clusters:

2017-05-26 23:07:46 UTC Starting backup.
...
2017-05-26 23:07:51 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

[The second line means backup is done and we are starting to upload it to where it should be stored.] This produced backup of size 193MB

And then it started getting longer:

2017-05-27 00:07:53 UTC Starting backup.    
...
2017-05-27 00:09:31 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

[It was still 193MB backup - and there were a bunch of backups like this, like 7 or so, each taking ~2m]

Then this one 4 minutes long:

2017-05-27 07:23:13 UTC Starting backup.
...
2017-05-27 07:27:36 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

And the one more which started before we exceeded database space, but finished after (apiserver logs below):

2017-05-27 08:28:35 UTC Starting backup.
...
2017-05-27 08:42:05 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

This one produced backup of 1.29GB size.

2017-05-27 08:34:52 UTC status.go:62    apiserver received an error that is not an metav1.Status: etcdserver: mvcc: database space exceeded      
2017-05-27 08:34:52 UTC status.go:62    apiserver received an error that is not an metav1.Status: etcdserver: mvcc: database space exceeded      
2017-05-27 08:34:52 UTC status.go:62    apiserver received an error that is not an metav1.Status: etcdserver: mvcc: database space exceeded

That said, I'm not seeing any logs about those long backups in etcd log (or I don't know what to look at there).

I also looked at the cluster for which I reported the issue (I managed to get all previous logs from doing backups) and the situation was pretty similar:

Then to 2minutes+

2017-05-26 16:44:31 UTC Starting backup.
...
2017-05-26 16:46:47 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

then to 4minutes+

2017-05-27 01:18:27 UTC Starting backup.
...
2017-05-27 01:22:37 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}

or 3 minutes+ right before we hit the "mvcc: database space exceeded":

2017-05-27 03:30:18 UTC Starting backup.
...
2017-05-27 03:33:24 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
wojtek-t commented 7 years ago

Though in the second case, the machine was overloaded, which might have also impact the time backups were taking.

xiang90 commented 7 years ago

but it should cause etcd database space exceeded in my opinion.

you mean should not, right? I tend to agree on this if that is the only bad thing happened.

A few minutes snapshot is not a unreasonably long running one. I was worrying about there is a hanging one or leaky one (never finished).

Thanks for all the information. I will try to see what can happen when etcd is taking a snapshot + receiving some large value updates.

We will keep you updated on this.

wojtek-t commented 7 years ago

you mean should not, right?

Yes, sorry.

I was worrying about there is a hang one or leaky one (never finished).

I see. I didn't see anything like that.

We will keep you updated on this.

Thanks a lot for debugging this @xiang90

xiang90 commented 7 years ago

@wojtek-t Can you check if etcd printing logs like

 etcdserver: apply entries took too long [40.772315ms for 1 entries]

when it was taking snapshot? (it should print out this much less frequent without a running snapshot).

mwf commented 7 years ago

@wojtek-t hi!

Please take a look #7472. Maybe you hit the similar problem as we did.

xiang90 commented 7 years ago

@wojtek-t

https://github.com/boltdb/bolt/pull/695 and https://github.com/boltdb/bolt/pull/694 should fix this issue.

we are still waiting for boltdb upstream to get this merged. boltdb upstream seems not to have enough bandwidth to review these.

if you can find someone from kubernetes community to help review it. boltdb is a simple b+tree, it should not take a developer more than a day or two to understand what it does.

xiang90 commented 7 years ago

should be fixed by https://github.com/coreos/etcd/pull/8210

xiang90 commented 7 years ago

fixed by #8210

wojtek-t commented 7 years ago

@xiang90 - great thanks! When can we expect this to be part of some release? Also will it be backported to 3.0. and/or 3.1. ?

wojtek-t commented 7 years ago

@xiang90 - great thanks! When can we expect this to be part of some release? Also will it be backported to 3.0. and/or 3.1. ?

@xiang90 - friendly ping ^^

sebastianwoinar commented 7 years ago

Has this been released already? Even 3.2.6 seems to still rely on the old backend (https://github.com/coreos/etcd/blob/v3.2.6/glide.yaml#L5).

heyitsanthony commented 7 years ago

@sebastianwoinar it will be in 3.3.

tatsuhiro-t commented 7 years ago

We experienced the same issue today in our kubernetes cluster. DB size gets from 80MB to 12GB. We did very frequent snapshot (less than 5min). It looks like the root cause is boltdb, but does frequent snapshot trigger boltdb issue? Or just a completely different?

loxo33 commented 7 years ago

We are experiencing this issue with 3.2.7 and 3.2.9. Can this issue be re-opened and investigated, please?

xiang90 commented 7 years ago

@loxo33 the patch is merged in to master, not 3.2.x

xiang90 commented 7 years ago

We experienced the same issue today in our kubernetes cluster. DB size gets from 80MB to 12GB. We did very frequent snapshot (less than 5min). It looks like the root cause is boltdb, but does frequent snapshot trigger boltdb issue? Or just a completely different?

probably a boltdb issue.

christianhuening commented 6 years ago

We might be hitting this as well with etcd 3.1. So what is the recommended action to get a working cluster again? Compacting, defragmenting and disarming does not work as the alarm gets re-raised immediately

sokoow commented 6 years ago

Where are we with this ? according to the comment above, it'd should have been in 3.3, but this is still unresolved right ?

karankh commented 6 years ago

Did you guys try to run defrag periodically, we do and it helps keep the size in check.

ghost commented 6 years ago

Is there any recommended action yet? We are running into this cluster issue aswell.

christianhuening commented 6 years ago

@ricklagerweij For me it turned out to be a configuration issue. I had to set the --quota-backend-bytes flag to something larger than 2GB for my cluster to continue working, as the cluster size was 2,3 GB. It defaults to 2GB, which is documented here: https://github.com/coreos/etcd/blob/master/Documentation/dev-guide/limit.md#storage-size-limit After that the defrag etc worked again as expected.

ghost commented 6 years ago

@christianhuening Thanks! That makes sense as 2 out of 3 nodes had a database which had a size of 2.1GB. I'm not familiar with etcd, how would I tweak this on an existing cluster? This page doesn't really give me much operational recommendations https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md

christianhuening commented 6 years ago

@ricklagerweij Well, assuming you have it running via systemd: You go to node 1, edit the systemd service file for etcd and add --quota-backend-bytes as an argument (pass the desired GBs as bytes (3GB = 3.221.225.472). Then you restart etcd on that node via systemctl restart etcd and monitor the status of the cluster on another node via etcdctl until all 3 nodes are healthy again (you can loose 1 out of 3 nodes with 3 nodes due to quorum) . Then repeat for nodes 2 and 3.

ghost commented 6 years ago

@christianhuening makes sense, thanks again.

typhoonzero commented 6 years ago

@xiang90 I've encountered the same issue now, according to the discussion, I need to update the version of my etcd right? Is there any workaround so I can get a quick fix?