cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.05k stars 3.8k forks source link

stability: local storage fills up on connection problems #4977

Closed mberhault closed 8 years ago

mberhault commented 8 years ago

Build sha: c6ebc42e0878227eb6676618a2d88d5098f435a8

In a scenario similar to #4925, we have node problems at the following times: node1: SIGABRT due to pthread_create failed W160308 01:23:45 node1.log.parse.txt.gz

node2: endlessly running, stopped manually. node2.log.parse.txt.gz

node3: OOM W160308 01:15:16 node3.log.parse.txt.gz

node4: filled up /mnt/data (all 30GB of it). W160308 08:12:18 node4.log.parse.txt.gz

Interestingly, node4 is the only one not in the list of dns entries for beta.cockroachlabs.com, which is the URL used by the photos app.

I'm running cockroach debug keys on the whole data directory and doing a basic count by prefix. Will update when done.

mberhault commented 8 years ago

For info, the largest prefixes on node4:/mnt/data by count of keys under that prefix are:

    732 "/Local/RangeID/7/u
    836 "/Table/52/1/\"
    915 "/Local/RangeID/8/r
   1145 "/Table/53/1/\"
   1325 "/Local/RangeID/5/u
  15819 "/Table/53/2/\"\"
  64787 "/Table/53/1/\"\"
  70586 "/Local/RangeID/7/r
 248790 "/Local/RangeID/1/u
457249476 "/Local/RangeID/4/u
bdarnell commented 8 years ago

/Local/RangeID/*/u is almost certainly the raft log (there are other things that appear in that space but the raft log is the only one that is expected to grow to many keys). The rebalancer was moving things around quite a bit; node 4 was the last one to be re-added to range 4:

$ zgrep -h _REPLICA *|sort|grep 'range 4'
I160307 20:34:22.358123 storage/replica.go:1072  raft: proposing ADD_REPLICA {4 4 2} for range 4
I160307 20:34:22.369000 storage/replica.go:1072  raft: proposing ADD_REPLICA {3 3 3} for range 4
I160307 20:53:22.535040 storage/replica.go:1072  raft: proposing ADD_REPLICA {2 2 4} for range 4
I160307 20:53:22.550613 storage/replica.go:1072  raft: proposing REMOVE_REPLICA {4 4 2} for range 4
I160307 20:55:12.549447 storage/replica.go:1072  raft: proposing ADD_REPLICA {4 4 5} for range 4
I160307 20:55:12.575113 storage/replica.go:1072  raft: proposing REMOVE_REPLICA {1 1 1} for range 4
I160307 21:02:00.401559 storage/replica.go:1072  raft: proposing ADD_REPLICA {1 1 6} for range 4
I160307 21:02:00.417585 storage/replica.go:1072  raft: proposing REMOVE_REPLICA {2 2 4} for range 4
I160307 21:17:18.141883 storage/replica.go:1072  raft: proposing ADD_REPLICA {2 2 7} for range 4
I160307 21:17:18.167182 storage/replica.go:1072  raft: proposing REMOVE_REPLICA {3 3 3} for range 4
I160307 21:25:27.730692 storage/replica.go:1072  raft: proposing ADD_REPLICA {3 3 8} for range 4
I160307 21:25:27.745389 storage/replica.go:1072  raft: proposing REMOVE_REPLICA {1 1 6} for range 4
I160307 21:25:27.747938 storage/replica.go:1072  raft: proposing REMOVE_REPLICA {1 1 6} for range 4
I160307 21:48:33.313587 storage/replica.go:1072  raft: proposing ADD_REPLICA {1 1 9} for range 4
I160307 21:48:33.343425 storage/replica.go:1072  raft: proposing REMOVE_REPLICA {3 3 8} for range 4
I160307 21:49:06.507886 storage/replica.go:1072  raft: proposing ADD_REPLICA {3 3 10} for range 4
I160307 21:49:06.530137 storage/replica.go:1072  raft: proposing REMOVE_REPLICA {4 4 5} for range 4
I160307 21:57:25.885974 storage/replica.go:1072  raft: proposing ADD_REPLICA {4 4 11} for range 4
I160307 21:57:25.904065 storage/replica.go:1072  raft: proposing REMOVE_REPLICA {2 2 7} for range 4

Due to rocksdb write amplification this means that node 4 could be expected to take up more space than the others with all its recently-written data. What was disk usage like on the other three nodes? (especially nodes 1 and 3. node 2 was not a member of the range at the end of the logs). That is, did node 4 contain more log entries and other data than the others, or was that data just stored less efficiently because it had just been rewritten?

bdarnell commented 8 years ago

From offline discussion: node 4 filled up its 30GB of disk while the other three nodes were using less than 1GB. So it's not just write amplification from the last applySnapshot.

The data has been deleted, but next time this happens it would be helpful to run cockroach debug raft-log $DATA_DIR $RANGE_ID. That will produce a ton of output but the last few KB are probably the interesting part.

tamird commented 8 years ago

Should we close this since the data has been deleted? Seems there's nothing actionable here right now.

mberhault commented 8 years ago

I think we should. I'll file a new one if this occurs again. It hasn't since.