Closed petermattis closed 8 years ago
More detail: the memory spike happens when the paused node resumes.
From the log of the node with the memory explosion I see:
I160602 17:23:56.852051 storage/replica_raftstorage.go:524 generated snapshot for range 9 at index 10 in 515.102351ms. encoded size=34325773, 85770 KV pairs, 0 log entries
I160602 17:24:12.619729 storage/replica_raftstorage.go:524 generated snapshot for range 9 at index 4990 in 397.115011ms. encoded size=44918169, 95677 KV pairs, 4980 log entries
So we generated snapshot data that is 33 and 43 MB respectively. But note that this is the marshaled size. Before the marshaling we have at least that much data in the unmarshaled roachpb.RaftSnapshotData
structure. And we're storing this marshaled data into the raft.Snapshot.Data
field. The raft.Snapshot
is then placed in the Snapshot
field of a raft.Message
. I haven't followed the code further than that, but I'm counting at least 3x the data size for memory usage. Even though some of this is short-lived memory usage, it is unfortunate.
Following the thread a bit further, the raft.Message
then gets placed inside a roachpb.RaftMessageRequest
which is eventually marshaled and sent via gRPC.
A memory profile over a 4 second period where 2 snapshots were generated shows:
192.22MB 14.53% 14.53% 192.22MB 14.53% github.com/cockroachdb/cockroach/roachpb.(*RaftSnapshotData).Marshal
186.26MB 14.08% 28.60% 186.26MB 14.08% github.com/cockroachdb/cockroach/storage/engine.(*MVCCMetadata).Unmarshal
150.72MB 11.39% 40.00% 296.44MB 22.40% google.golang.org/grpc.encode
145.71MB 11.01% 51.01% 145.71MB 11.01% github.com/cockroachdb/cockroach/storage.(*RaftMessageRequest).Marshal
111.05MB 8.39% 59.40% 111.05MB 8.39% github.com/cockroachdb/cockroach/storage/engine.ChunkAllocator.newChunk
93.39MB 7.06% 66.46% 500.31MB 37.81% github.com/cockroachdb/cockroach/storage.snapshot
73.61MB 5.56% 72.02% 73.61MB 5.56% github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal
44.86MB 3.39% 75.41% 200.58MB 15.16% github.com/cockroachdb/cockroach/storage.entries
Notice that 4th column which is cumulative allocations. 500 MB were allocated by storage.snapshot
.
This corresponds to the following snapshots:
I160603 10:45:57.666859 storage/replica_raftstorage.go:524 generated snapshot for range 7 at index 47787 in 1.167403971s. encoded size=116916211, 131833 KV pairs, 47777 log entries
I160603 10:45:58.809946 storage/replica_raftstorage.go:524 generated snapshot for range 8 at index 23468 in 1.062962439s. encoded size=84632419, 132818 KV pairs, 23458 log entries
Notice the size of the snapshots are 111 and 80 MB. Because we don't truncate the raft GC log when one of the replicas is down (#6012), I think the log can grow excessively large.
Pausing a node for 30s isn't long enough for the repair process to start, so these snapshots must be consequences of rebalancing. The rebalancer is moving ranges onto the node that is down because it doesn't know any better yet. This will be alleviated by both preemptive snapshots and capacity reservations. I wonder if it might also be worthwhile for the store pool to have an intermediate state between "up" and "down" during which time it won't try to repair the ranges on the node, but it also won't try to assign it new load.
This is a 3 node system, so rebalancing shouldn't be having an effect. I think the snapshots are caused because ranges are still being split while the node is down and when it comes back up it finds that it has several new ranges that it needs to replicate.
@cuongdo, @BramGruneir The rebalancer shouldn't be allocating ranges onto a node that hasn't responded recently.
Ooh, splits are a good theory. That's a case of snapshots that won't be helped by sending snapshots preemptively during replication (but it would be helped by end-to-end throttling). These snapshots are being generated and won't even be applied because the recipient needs to process the split first.
It's difficult to do anything about this (besides the aforementioned throttling) while raft is in control of the snapshot process. We'd need the same kind of modifications that we'd need for streaming snapshots.
cc @tamird
Figuring out end-to-end throttling of snapshots and/or streaming snapshots seems worthwhile.
What's would you consider recently? With the reservation system, it won't try to replicate without a proper response, so this should be a positive side effect of adding the system.
To be clear, in a 3 node system, there shouldn't be any rebalancing. Just the upreplication of the first node/nodes.
When a split happens, if the node was only down for 30 seconds during the split, shouldn't it get the logs of the nodes prior to the split including the split command?
On Fri, Jun 3, 2016 at 9:01 PM, Peter Mattis notifications@github.com wrote:
This is a 3 node system, so rebalancing shouldn't be having an effect. I think the snapshots are caused because ranges are still being split while the node is down and when it comes back up it finds that it has several new ranges that it needs to replicate.
@cuongdo https://github.com/cuongdo, @BramGruneir https://github.com/BramGruneir The rebalancer shouldn't be allocating ranges onto a node that hasn't responded recently.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/7018#issuecomment-223726711, or mute the thread https://github.com/notifications/unsubscribe/ABihuUrSjKW_cO2IxtvPXbw0OuxxvcYSks5qIM6AgaJpZM4Is9E1 .
@BramGruneir Yes, the reservation RPC should prevent allocation to a non-responsive node. Yes, with a 3 node system there shouldn't be any rebalancing. I'm not precisely sure what is happening when splits occur in a 3 node system, but I can definitely see from the logs that snapshots are being generated and applied when one of the nodes is taken down for ~30s and then brought back up.
If a range is present on all nodes, one node disappears and then the range splits, the new ranges will also live on all three nodes, so when the third node comes back up it will receive snapshots for the ranges it hasn't yet split (since it typically hasn't applied the Raft log for those splits yet on their left-hand side range, especially if the new ranges also have been split). One interesting case is that of the original range (which has been split) not being able to catch up the newly restarted node any more. In that case, the original range would be updated via snapshot, and only once that snapshot has landed the other snapshots (for the right-hand sides of the splits) should be able to land. I don't think we adequately test that situation.
This issue might have benefitted from @d4l3k's recent change #7399, at least for the performance drop (which was caused by round-robin meta lookups hitting the paused node).
@tamird the remaining issue here should be fixed by #7551. Feel free to merge with that issue.
Very good, closing in favour of #7551. 🔨 (that's the closest emoji to a gavel)
It is pretty easy to replicate excessive Go memory usage in a 3 node cluster by pausing / resuming one of the nodes. Basic setup was to create a 3 node cluster using
roachdemo
, apply load usingblock_writer
, pause one of the nodes for ~30 sec and then resume it. You'll see a spike inGo Sys
memory at this point. Need to perform more investigation to see exactly where that memory is being used and why.Additionally, performance is stable while all the nodes are up. But with one node paused I see periodic drops to 0 ops/sec.