BrianGallew / cassandra_range_repair

python script to repair the primary range of a node in N discrete steps
MIT License
109 stars 47 forks source link

Multi-DC repair giving errors about imprecise repairs #51

Open TvdW opened 7 years ago

TvdW commented 7 years ago

command I used: python range_repair.py -H 127.0.0.1 -s 1 --datacenter DC2

$ nodetool ring | grep -B1 $(facter ipaddress) | tail -n 2
10.2.0.1   R1          Up     Normal  15.57 GiB       ?                   9099366847329376090
10.2.0.2  R2         Up     Normal  14.53 GiB       ?                   9124888514323768492

$ nodetool repair -st 9099366847329376090 -et 9124888514323768492 -pr system_auth
[2016-09-27 20:56:31,822] Starting repair command #4071, repairing keyspace system_auth with repair options (parallelism: parallel, primary range: true, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 1)
[2016-09-27 20:56:31,884] Requested range intersects a local range but is not fully contained in one; this would lead to imprecise repair
[2016-09-27 20:56:31,885] null

system_auth is: CREATE KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '8', 'DC2': '8'} AND durable_writes = true;

The two tokens (9099366847329376090, 9124888514323768492) are also the ones used by range-repair. Those tokens are in DC2, but there's another DC1 token that sits in the middle, 9108060243154565075. When I trigger two individual nodetool repair commands ((9099366847329376090 9108060243154565075] and (9108060243154565075 9124888514323768492]) for them, it works fine. Only when the two ranges are merged, does it fail.

Ironically, not passing --datacenter to the script allows repairs to complete.

TvdW commented 7 years ago

Chatted a bit on IRC, my conclusion:

k, knowing all this, I'd say the fix should be to split all ranges according to the tokens held by _other_ DCs keep all the current logic, but for every determined range, do one more split that'll solve at least the problem that made me file a ticket
subvillion commented 6 years ago

Multi DC topology with RF3, DC1 (3 nodes) - DC2 (3 nodes) @TvdW I have is a similar error. It is not clear what this means?

BrianGallew commented 6 years ago

Thanks for figuring this out. I'll look it over in a bit.

On Tue, Feb 27, 2018 at 1:33 AM x0x01 notifications@github.com wrote:

Multi DT topology with RF3, DC1 (3 nodes) - DC2 (3 nodes) @TvdW https://github.com/tvdw I have is a similar error. It is not clear what this means?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BrianGallew/cassandra_range_repair/issues/51#issuecomment-368786506, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXWS5j9YbCezQtH-PC45FC2eOgnIU7gks5tY73agaJpZM4KIKbA .

vladovav commented 5 years ago

I see the same errors having multi DC, 4 + 4 nodes when provided parameter with DC name. Repairs are completing fine without DC name. ( C* 2.2.12, using vnodes)

floco06 commented 5 years ago

Hi there,

We are trying to implrement your script for our internal scheduling tool to repair and to use for our future upgrade to Cassandra3.11.2 .

I'm getting the same errors with imprecise repairs with that option --datacenter, whereas i'm not getting any errors when it's not specified.

range_repair.py -v -s 10 -D b -k XX-c XX INFO 2018-11-07 16:00:55,901 get_local_nodes line: 123 : Local nodes: X
INFO 2018-11-07 16:01:00,172 get_ring_tokens line: 166 : Found 1536 tokens INFO 2018-11-07 16:01:00,181 repair line: 578 : [1/256] repairing range (+xxxx, -xxxxxx) in 10 steps forkeyspace X WARNING 2018-11-07 16:01:01,361 call line: 62 : Execution failed. WARNING 2018-11-07 16:01:01,362 call line: 72 : Giving up execution. Failed too many times. ERROR 2018-11-07 16:01:01,362 _repair_range line: 507 : FAILED: 1/256 step 0001 nodetool -h nodeX -p 7199 repair KS CF -pr -full -st +xxxx -et +xxxxx ERROR 2018-11-07 16:01:01,362 _repair_range line: 508 : error: Repair job has failed with the error message: Repair command #204801 failed with error Requested range (xxxxxxx] intersects a local range but is not fully contained in one; this would lead to imprecise repair. keyspace: xxxxxx

Have you looked into it since this issue was open ?

And if yes, shall we use or not this --datacenter option on the node we are repairing with its local datacenter ?

Thanks in advance for your answer, Florian.