stability: memory issues when pausing/resuming a node

petermattis commented 8 years ago

It is pretty easy to replicate excessive Go memory usage in a 3 node cluster by pausing / resuming one of the nodes. Basic setup was to create a 3 node cluster using roachdemo, apply load using block_writer, pause one of the nodes for ~30 sec and then resume it. You'll see a spike in Go Sys memory at this point. Need to perform more investigation to see exactly where that memory is being used and why.

Additionally, performance is stable while all the nodes are up. But with one node paused I see periodic drops to 0 ops/sec.

petermattis commented 8 years ago

More detail: the memory spike happens when the paused node resumes.

petermattis commented 8 years ago

From the log of the node with the memory explosion I see:

I160602 17:23:56.852051 storage/replica_raftstorage.go:524  generated snapshot for range 9 at index 10 in 515.102351ms. encoded size=34325773, 85770 KV pairs, 0 log entries
I160602 17:24:12.619729 storage/replica_raftstorage.go:524  generated snapshot for range 9 at index 4990 in 397.115011ms. encoded size=44918169, 95677 KV pairs, 4980 log entries

So we generated snapshot data that is 33 and 43 MB respectively. But note that this is the marshaled size. Before the marshaling we have at least that much data in the unmarshaled roachpb.RaftSnapshotData structure. And we're storing this marshaled data into the raft.Snapshot.Data field. The raft.Snapshot is then placed in the Snapshot field of a raft.Message. I haven't followed the code further than that, but I'm counting at least 3x the data size for memory usage. Even though some of this is short-lived memory usage, it is unfortunate.

petermattis commented 8 years ago

Following the thread a bit further, the raft.Message then gets placed inside a roachpb.RaftMessageRequest which is eventually marshaled and sent via gRPC.

petermattis commented 8 years ago

A memory profile over a 4 second period where 2 snapshots were generated shows:

  192.22MB 14.53% 14.53%   192.22MB 14.53%  github.com/cockroachdb/cockroach/roachpb.(*RaftSnapshotData).Marshal
  186.26MB 14.08% 28.60%   186.26MB 14.08%  github.com/cockroachdb/cockroach/storage/engine.(*MVCCMetadata).Unmarshal
  150.72MB 11.39% 40.00%   296.44MB 22.40%  google.golang.org/grpc.encode
  145.71MB 11.01% 51.01%   145.71MB 11.01%  github.com/cockroachdb/cockroach/storage.(*RaftMessageRequest).Marshal
  111.05MB  8.39% 59.40%   111.05MB  8.39%  github.com/cockroachdb/cockroach/storage/engine.ChunkAllocator.newChunk
   93.39MB  7.06% 66.46%   500.31MB 37.81%  github.com/cockroachdb/cockroach/storage.snapshot
   73.61MB  5.56% 72.02%    73.61MB  5.56%  github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal
   44.86MB  3.39% 75.41%   200.58MB 15.16%  github.com/cockroachdb/cockroach/storage.entries

Notice that 4th column which is cumulative allocations. 500 MB were allocated by storage.snapshot.

This corresponds to the following snapshots:

I160603 10:45:57.666859 storage/replica_raftstorage.go:524  generated snapshot for range 7 at index 47787 in 1.167403971s. encoded size=116916211, 131833 KV pairs, 47777 log entries
I160603 10:45:58.809946 storage/replica_raftstorage.go:524  generated snapshot for range 8 at index 23468 in 1.062962439s. encoded size=84632419, 132818 KV pairs, 23458 log entries

Notice the size of the snapshots are 111 and 80 MB. Because we don't truncate the raft GC log when one of the replicas is down (#6012), I think the log can grow excessively large.

bdarnell commented 8 years ago

Pausing a node for 30s isn't long enough for the repair process to start, so these snapshots must be consequences of rebalancing. The rebalancer is moving ranges onto the node that is down because it doesn't know any better yet. This will be alleviated by both preemptive snapshots and capacity reservations. I wonder if it might also be worthwhile for the store pool to have an intermediate state between "up" and "down" during which time it won't try to repair the ranges on the node, but it also won't try to assign it new load.

petermattis commented 8 years ago

This is a 3 node system, so rebalancing shouldn't be having an effect. I think the snapshots are caused because ranges are still being split while the node is down and when it comes back up it finds that it has several new ranges that it needs to replicate.

@cuongdo, @BramGruneir The rebalancer shouldn't be allocating ranges onto a node that hasn't responded recently.

bdarnell commented 8 years ago

Ooh, splits are a good theory. That's a case of snapshots that won't be helped by sending snapshots preemptively during replication (but it would be helped by end-to-end throttling). These snapshots are being generated and won't even be applied because the recipient needs to process the split first.

It's difficult to do anything about this (besides the aforementioned throttling) while raft is in control of the snapshot process. We'd need the same kind of modifications that we'd need for streaming snapshots.

cc @tamird

petermattis commented 8 years ago

Figuring out end-to-end throttling of snapshots and/or streaming snapshots seems worthwhile.

BramGruneir commented 8 years ago

What's would you consider recently? With the reservation system, it won't try to replicate without a proper response, so this should be a positive side effect of adding the system.

To be clear, in a 3 node system, there shouldn't be any rebalancing. Just the upreplication of the first node/nodes.

When a split happens, if the node was only down for 30 seconds during the split, shouldn't it get the logs of the nodes prior to the split including the split command?

On Fri, Jun 3, 2016 at 9:01 PM, Peter Mattis notifications@github.com wrote:

This is a 3 node system, so rebalancing shouldn't be having an effect. I think the snapshots are caused because ranges are still being split while the node is down and when it comes back up it finds that it has several new ranges that it needs to replicate.

@cuongdo https://github.com/cuongdo, @BramGruneir https://github.com/BramGruneir The rebalancer shouldn't be allocating ranges onto a node that hasn't responded recently.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/7018#issuecomment-223726711, or mute the thread https://github.com/notifications/unsubscribe/ABihuUrSjKW_cO2IxtvPXbw0OuxxvcYSks5qIM6AgaJpZM4Is9E1 .

petermattis commented 8 years ago

@BramGruneir Yes, the reservation RPC should prevent allocation to a non-responsive node. Yes, with a 3 node system there shouldn't be any rebalancing. I'm not precisely sure what is happening when splits occur in a 3 node system, but I can definitely see from the logs that snapshots are being generated and applied when one of the nodes is taken down for ~30s and then brought back up.

tbg commented 8 years ago

If a range is present on all nodes, one node disappears and then the range splits, the new ranges will also live on all three nodes, so when the third node comes back up it will receive snapshots for the ranges it hasn't yet split (since it typically hasn't applied the Raft log for those splits yet on their left-hand side range, especially if the new ranges also have been split). One interesting case is that of the original range (which has been split) not being able to catch up the newly restarted node any more. In that case, the original range would be updated via snapshot, and only once that snapshot has landed the other snapshots (for the right-hand sides of the splits) should be able to land. I don't think we adequately test that situation.

tbg commented 8 years ago

This issue might have benefitted from @d4l3k's recent change #7399, at least for the performance drop (which was caused by round-robin meta lookups hitting the paused node).

petermattis commented 8 years ago

@tamird the remaining issue here should be fixed by #7551. Feel free to merge with that issue.

tamird commented 8 years ago

Very good, closing in favour of #7551. 🔨 (that's the closest emoji to a gavel)

cockroachdb / cockroach

stability: memory issues when pausing/resuming a node #7018