Closed wojtek-t closed 7 years ago
FYI, restoring from all those snapshots locally and running:
ETCDCTL_API=3 etcdctl get /aaa /zzz > out.txt
ls -la out.txt
always returns files of size in the order of at most tens of MB.
@wojtek-t We got the similar issue on the DB file size increasement. What we have done is to do defrag periodically. The DB file size decreases a lot(from 4.7GB to ~200MB).
@benbjohnson can you take a look at the boltdb data? We double checked the etcd txn, and do not feel we can have a long running transaction that can cause the db size continuous to grow.
@wojtek-t
2017-05-25T23:27:03-07:00_snapshot.db 1,56 GB 2017-05-26T00:30:13-07:00_snapshot.db 1,56 GB 2017-05-26T01:05:24-07:00_snapshot.db 1,56 GB
Do you know why the snapshot interval changed here? it was ~1hr per snapshot. but it is only 30min between the last two.
@xiang90 The snapshot and batchTxBufferred
seem to be the only places using bolt.Tx.Begin(false)
. It seems like it could get hung in a spot such as newSnapshotReaderCloser()
where WriteTo()
gets blocked on the output writer.
Can you add logging to print the Tx
pointer (using the %p
formatter) when you open and close a snapshot read transaction?
https://github.com/coreos/etcd/blob/master/mvcc/backend/backend.go#L173
https://github.com/coreos/etcd/blob/master/mvcc/backend/backend.go#L439-L443
I'm not sure how much it'll help for me to look at the DB. Also, I am pretty slammed right now and don't have the bandwidth. My suspicion is that it's the snapshot getting hung or somehow the Tx is getting dropped. There are some spots where a panic()
could prevent the Tx
from being cleaned up but I couldn't tell if you (or an underlying library like net/http
) has a recover()
function masking that issue.
Do you know why the snapshot interval changed here? it was ~1hr per snapshot. but it is only 30min between the last two.
Good question. Maybe the underlying VM was restarted or sth? But that's just guessing, I don't really know.
@wojtek-t As @benbjohnson mentioned, if there is a pending snapshot for a long time, we will have troubles. so i want you to check if there is any failed or pending snapshot. if there is not, we probably can rule out the long running txn thing. we also add quite a few warning lines around long running read only txn in etcd master. if you can reproduce the issue, i would suggest you to back port the debugging patch and give it a try.
@xiang90 - are there any logs that may confirm it in 3.0.17? We don't have reproduction for it (and backporting is hard on our side).
are there any logs that may confirm it in 3.0.17
sorry. not really. :(
We don't have reproduction for it
what we plan to do at our side is to set up a few long running k8s stressful clusters against etcd3 while doing 10minute interval snapshot and some levels of controlled failure injection. Not sure if we can get a reproduce though.
But, for 3.0, we are at a bad situation here. No reproduce, not enough logging. We have some guess around what happened, but they are not super convincing. The only way to move forward is probably to get more debugging logging and reproduce it.
@armstrongli if you guys can reproduce it, it would be super helpful.
@xiang90 - do you know why between 1.56GB and 3.8GB the number of:
dropped by order of magnitude, but the size of DB more the doubled? That seems pretty strange to me....
what we plan to do at our side is to set up a few long running k8s stressful clusters against etcd3 while doing 10minute interval snapshot and some levels of controlled failure injection. Not sure if we can get a reproduce though.
Ironicaly, so far those problems are happening on small clusters (less than 10 nodes). And there isn't big churn on them (at least on some of them). So we don't even need scale for this...
Ironicaly, so far those problems are happening on small clusters (less than 10 nodes).
Hmm.... Is it because most of GKE users run small clusters? Or the bad percentage is indeed larger for small (low activity) clusters?
Hmm.... Is it because most of GKE users run small clusters? Or the bad percentage is indeed larger for small (low activity) clusters?
Can be both. What I wanted to say is just potentially it may not be easier to reproduce in scale than in small clusters.
@xiang90 - I looked into logs of our backup tasks. Unfortunately, it seems that they were log-rotated in the meantime, but we have something. In particular I'm seeing this:
E0526 08:22:32.952657 5 main.go:209] stdout from failed etcdctl backup:
E0526 08:22:32.952698 5 main.go:210] stderr from failed etcdctl backup: 2017/05/26 08:22:30 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:2379" <nil>}
2017/05/26 08:22:31 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:2379" <nil>}
Error: grpc: timed out when dialing
Those times are UTC, so this happened during an hour when DB size grown from 1.56GB to 2.18GB. I don't see any other failures later, though I don't know what happened before.
dropped by order of magnitude, but the size of DB more the doubled?
boltdb neve reclaims the pages. What you see there are pages in use, there is also a freelist that tracks all freed pages. for 3.8gb one, i assume most of the pages become free?
one actually interesting point of your data is that the overflow pages are huge after 15.6MB one and becomes normal in the 3.8gb one.
in 1.56 gb one, for example, the overflow page/logic page ~ 20, which means the avg size of the key is (4000+ 115179) * 4kb/10978. is that what you expect?
E0526 08:22:32.952657 5 main.go:209] stdout from failed etcdctl backup:
so we do have failed snapshot?
so we do have failed snapshot?
It failed on the client side - I don't know what happened on the server side (if it even reached the server or not).
FYI, restoring from all those snapshots locally and running:
ETCDCTL_API=3 etcdctl get /aaa /zzz > out.txt ls -la out.txt always returns files of size in the order of at most tens of MB.
This is also strange. For the 500MB one, the actual usage is 97%, which is (451590110). Probably it worth to look at the data through https://github.com/br0xen/boltbrowser to see what are in there.
I am curious how 10978 k8s keys take up to ~450MB.
E0526 08:22:32.952698 5 main.go:210] stderr from failed etcdctl backup: 2017/05/26 08:22:30 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:2379"
}
do you have server side logging around the same time?
What I wanted to say is just potentially it may not be easier to reproduce in scale than in small clusters.
correct. we will try that. it would be helpful if you guys can also do some testing. i believe google probably has more computation resources than us :P.
do you have server side logging around the same time?
I do - what should I look for?
I do - what should I look for?
any panics? any warning or error level logging?
I don't see any panic and warnings. But it seems that I missed something important initially. It seems that etcd was restarted multiple times in the meantime. It was also restarted around that time. I'm trying to figure out why....
So I looked deeper into what was happening with this cluster and it was very heavily overloaded. In particular we were restarting different components every now and then do to them not responding to liveness probes (etcd) or not being able to refresh master leases.
Regarding the large objects - those were huge endpoints containing O(1000) failed pods. This is k8s issue and I will file a separate issue in k8s repo for it, but it should cause etcd database space exceeded in my opinion.
So I started looking into other clusters in which we faced the same issue, and it seems that there are some long-running "etcdctl snapshot save" operations there.
In particular, examples from logs from other clusters:
2017-05-26 23:07:46 UTC Starting backup.
...
2017-05-26 23:07:51 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
[The second line means backup is done and we are starting to upload it to where it should be stored.] This produced backup of size 193MB
And then it started getting longer:
2017-05-27 00:07:53 UTC Starting backup.
...
2017-05-27 00:09:31 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
[It was still 193MB backup - and there were a bunch of backups like this, like 7 or so, each taking ~2m]
Then this one 4 minutes long:
2017-05-27 07:23:13 UTC Starting backup.
...
2017-05-27 07:27:36 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
And the one more which started before we exceeded database space, but finished after (apiserver logs below):
2017-05-27 08:28:35 UTC Starting backup.
...
2017-05-27 08:42:05 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
This one produced backup of 1.29GB size.
2017-05-27 08:34:52 UTC status.go:62 apiserver received an error that is not an metav1.Status: etcdserver: mvcc: database space exceeded
2017-05-27 08:34:52 UTC status.go:62 apiserver received an error that is not an metav1.Status: etcdserver: mvcc: database space exceeded
2017-05-27 08:34:52 UTC status.go:62 apiserver received an error that is not an metav1.Status: etcdserver: mvcc: database space exceeded
That said, I'm not seeing any logs about those long backups in etcd log (or I don't know what to look at there).
I also looked at the cluster for which I reported the issue (I managed to get all previous logs from doing backups) and the situation was pretty similar:
2017-05-26 00:24:57 UTC Starting backup.
...
2017-05-26 00:26:35 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
or
2017-05-26 07:28:54 UTC Starting backup
...
2017-05-26 07:30:14 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
Then to 2minutes+
2017-05-26 16:44:31 UTC Starting backup.
...
2017-05-26 16:46:47 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
then to 4minutes+
2017-05-27 01:18:27 UTC Starting backup.
...
2017-05-27 01:22:37 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
or 3 minutes+ right before we hit the "mvcc: database space exceeded":
2017-05-27 03:30:18 UTC Starting backup.
...
2017-05-27 03:33:24 UTC Backing up file {"/var/lib/etcd/backups" "snapshot.db"}
Though in the second case, the machine was overloaded, which might have also impact the time backups were taking.
but it should cause etcd database space exceeded in my opinion.
you mean should not
, right? I tend to agree on this if that is the only bad thing happened.
A few minutes snapshot is not a unreasonably long running one. I was worrying about there is a hanging one or leaky one (never finished).
Thanks for all the information. I will try to see what can happen when etcd is taking a snapshot + receiving some large value updates.
We will keep you updated on this.
you mean should not, right?
Yes, sorry.
I was worrying about there is a hang one or leaky one (never finished).
I see. I didn't see anything like that.
We will keep you updated on this.
Thanks a lot for debugging this @xiang90
@wojtek-t Can you check if etcd printing logs like
etcdserver: apply entries took too long [40.772315ms for 1 entries]
when it was taking snapshot? (it should print out this much less frequent without a running snapshot).
@wojtek-t hi!
Please take a look #7472. Maybe you hit the similar problem as we did.
@wojtek-t
https://github.com/boltdb/bolt/pull/695 and https://github.com/boltdb/bolt/pull/694 should fix this issue.
we are still waiting for boltdb upstream to get this merged. boltdb upstream seems not to have enough bandwidth to review these.
if you can find someone from kubernetes community to help review it. boltdb is a simple b+tree, it should not take a developer more than a day or two to understand what it does.
should be fixed by https://github.com/coreos/etcd/pull/8210
fixed by #8210
@xiang90 - great thanks! When can we expect this to be part of some release? Also will it be backported to 3.0. and/or 3.1. ?
@xiang90 - great thanks! When can we expect this to be part of some release? Also will it be backported to 3.0. and/or 3.1. ?
@xiang90 - friendly ping ^^
Has this been released already? Even 3.2.6 seems to still rely on the old backend (https://github.com/coreos/etcd/blob/v3.2.6/glide.yaml#L5).
@sebastianwoinar it will be in 3.3.
We experienced the same issue today in our kubernetes cluster. DB size gets from 80MB to 12GB. We did very frequent snapshot (less than 5min). It looks like the root cause is boltdb, but does frequent snapshot trigger boltdb issue? Or just a completely different?
We are experiencing this issue with 3.2.7 and 3.2.9. Can this issue be re-opened and investigated, please?
@loxo33 the patch is merged in to master, not 3.2.x
We experienced the same issue today in our kubernetes cluster. DB size gets from 80MB to 12GB. We did very frequent snapshot (less than 5min). It looks like the root cause is boltdb, but does frequent snapshot trigger boltdb issue? Or just a completely different?
probably a boltdb issue.
We might be hitting this as well with etcd 3.1. So what is the recommended action to get a working cluster again? Compacting, defragmenting and disarming does not work as the alarm gets re-raised immediately
Where are we with this ? according to the comment above, it'd should have been in 3.3, but this is still unresolved right ?
Did you guys try to run defrag periodically, we do and it helps keep the size in check.
Is there any recommended action yet? We are running into this cluster issue aswell.
@ricklagerweij For me it turned out to be a configuration issue. I had to set the --quota-backend-bytes
flag to something larger than 2GB for my cluster to continue working, as the cluster size was 2,3 GB. It defaults to 2GB, which is documented here: https://github.com/coreos/etcd/blob/master/Documentation/dev-guide/limit.md#storage-size-limit
After that the defrag etc worked again as expected.
@christianhuening Thanks! That makes sense as 2 out of 3 nodes had a database which had a size of 2.1GB. I'm not familiar with etcd, how would I tweak this on an existing cluster? This page doesn't really give me much operational recommendations https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md
@ricklagerweij Well, assuming you have it running via systemd: You go to node 1, edit the systemd service file for etcd and add --quota-backend-bytes
as an argument (pass the desired GBs as bytes (3GB = 3.221.225.472). Then you restart etcd on that node via systemctl restart etcd
and monitor the status of the cluster on another node via etcdctl
until all 3 nodes are healthy again (you can loose 1 out of 3 nodes with 3 nodes due to quorum) . Then repeat for nodes 2 and 3.
@christianhuening makes sense, thanks again.
@xiang90 I've encountered the same issue now, according to the discussion, I need to update the version of my etcd right? Is there any workaround so I can get a quick fix?
We have observed already few cases, where suddenly (after days of running a GKE cluster), the size of database starts growing.
As an example, we have a cluster, that was running without any issues for ~2 weeks (size of database was ~16MB), and then its database started growing. The growth wasn't immediate - it took ~2days, before it reached 4GB limit and there were steps of growing. For the reference we have backups (snapshots) (done via etcdctl snapshot) that reflect the growth speed (the name contains the time when it was made)
We've checked that we were doing compaction very regularly every 5m for the whole time - so it doesn't seem to be the same as: https://github.com/coreos/etcd/issues/7944 I'm attaching the interested lines from etcd logs in etcd-compaction.txt
[Note time in that logs are in UTC, and time in snapshot names is PST, so 7 hours difference]
To summarize, the compaction was always at most few thousands of transactions (so it's not that we did a lot during some 5m period), though there were some longer compactions, up to ~7s.
I started digging into individual snapshots and found some strange thing (I was using bolt)
Page count statistics Number of logical branch pages: 10 Number of physical branch overflow pages: 0 Number of logical leaf pages: 789 Number of physical leaf overflow pages: 518 Tree statistics Number of keys/value pairs: 1667 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 40960 Bytes actually used for branch data: 26494 (64%) Bytes allocated for physical leaf pages: 5353472 Bytes actually used for leaf data: 3411680 (63%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 536 (0%)
Aggregate statistics for 10 buckets
Page count statistics Number of logical branch pages: 65 Number of physical branch overflow pages: 0 Number of logical leaf pages: 5559 Number of physical leaf overflow pages: 107743 Tree statistics Number of keys/value pairs: 13073 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 266240 Bytes actually used for branch data: 186912 (70%) Bytes allocated for physical leaf pages: 464084992 Bytes actually used for leaf data: 451590110 (97%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 536 (0%)
Aggregate statistics for 10 buckets
Page count statistics Number of logical branch pages: 70 Number of physical branch overflow pages: 0 Number of logical leaf pages: 4525 Number of physical leaf overflow pages: 115179 Tree statistics Number of keys/value pairs: 10978 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 286720 Bytes actually used for branch data: 152723 (53%) Bytes allocated for physical leaf pages: 490307584 Bytes actually used for leaf data: 478196884 (97%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 536 (0%)
Aggregate statistics for 10 buckets
Page count statistics Number of logical branch pages: 90 Number of physical branch overflow pages: 0 Number of logical leaf pages: 6219 Number of physical leaf overflow pages: 6791 Tree statistics Number of keys/value pairs: 15478 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 368640 Bytes actually used for branch data: 209621 (56%) Bytes allocated for physical leaf pages: 53288960 Bytes actually used for leaf data: 36704465 (68%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 536 (0%)
Aggregate statistics for 10 buckets
Page count statistics Number of logical branch pages: 89 Number of physical branch overflow pages: 0 Number of logical leaf pages: 6074 Number of physical leaf overflow pages: 6713 Tree statistics Number of keys/value pairs: 15173 Number of levels in B+tree: 3 Page size utilization Bytes allocated for physical branch pages: 364544 Bytes actually used for branch data: 204788 (56%) Bytes allocated for physical leaf pages: 52375552 Bytes actually used for leaf data: 36092789 (68%) Bucket statistics Total number of buckets: 10 Total number on inlined buckets: 9 (90%) Bytes used for inlined buckets: 564 (0%)