BrianGallew / cassandra_range_repair

python script to repair the primary range of a node in N discrete steps
MIT License
109 stars 47 forks source link

Question regarding steps and workers #50

Closed powellchristoph closed 7 years ago

powellchristoph commented 8 years ago

Hello,

First off, thank you for keeping this tool maintained. It is appreciated.

I am having some confusion around the number of step and number of workers. I was testing this and found that with the default 100 steps and a single worker, it takes a very long time to repair a small keyspace. I stopped it after 12 hours on a single node. How can I tweak these values to increase performance yet still retain the advantages of the ranged repair? What would be considered the "default" behavior of Cassandra? 15 steps with a single worker?

I have 12 nodes in a single datacenter and on my largest keyspace it can take upwards of 2 hours to repair a single node. If you throw in the occasional failure, then it can take over 24 hours to repair one of my datacenters. I would like to improve on this.

powellchristoph commented 8 years ago

Also, if I am repairing multiple keyspaces on the same node with the ranged repair, are there over-lapping ranges? Are there any ways to optimize it?

BrianGallew commented 8 years ago

So, in theory, fi you increase the number of workers by 10, you'll decrease the repair time by 90%. Unfortuantely, it doesn't quite work that way because none of this happens in a vacuum: other nodes are involved, and at some point you get to a place where no matter how many repairs you're running, the other cluster members have no spare threads to dedicate to the repair. Even before that, the node you're working on will only run as many concurrent repairs as it has threads allocated for that.