basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 392 forks source link

riak_core_handoff_sender {badrecord,ho_acc} #1010

Open Konstantin74R opened 11 months ago

Konstantin74R commented 11 months ago

Hi, please help. I can't find reson fo broken handoff between old node riak@172.16.3.92 to new node riak@172.16.3.63.

In log I see error "failed because of error:{badrecord,ho_acc}" and which node has the problem is unclear.

versions riak_kv 2.1.1 n_val = 3, nodes is 40

console.log 2023-09-13 10:48:55.880 [info] <0.25322.2>@riak_core_handoff_sender:start_fold:153 Starting ownership transfer of riak_kv_vnode from 'riak@172.16.3.92' 52808164630 1205155991565808696313200461616250880 to 'riak@172.16.3.63' 528081646301205155991565808696313200461616250880 2023-09-13 10:57:13.707 [info] <0.1490.1>@riak_kv_index_hashtree:clear_tree:842 Clearing AAE tree: 625134489405210427903529254618878869735643021312 2023-09-13 10:57:14.659 [info] <0.30711.2>@riak_kv_index_hashtree:build_or_rehash:880 Starting AAE tree build: 625134489405210427903529254618878869735643021312 2023-09-13 11:48:20.321 [error] <0.4371.0> gen_server <0.4371.0> terminated with reason: no function clause matching riak_core_handoff_sender:visit_item({<<48,98,5 8,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89, 118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 2023-09-13 11:48:20.321 [error] <0.4371.0> CRASH REPORT Process <0.4371.0> with 0 neighbours exited with reason: no function clause matching riak_core_handoff_send er:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0, 0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 in gen_server:terminate/6 line 744 2023-09-13 11:48:20.322 [error] <0.4368.0> Supervisor {<0.4368.0>,poolboy_sup} had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at <0.4371.0> exit with reason no function clause matching riak_core_handoff_sender:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 in context child_terminated 2023-09-13 11:48:20.323 [error] <0.25322.2>@riak_core_handoff_sender:start_fold:282 ownership transfer of riak_kv_vnode from 'riak@172.16.3.92' 528081646301205155991565808696313200461616250880 to 'riak@172.16.3.63' 528081646301205155991565808696313200461616250880 failed because of error:{badrecord,ho_acc} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,204}]}] error.log

2023-09-13 10:48:48.097 [error] <0.6963.0>@riak_core_handoff_sender:start_fold:282 ownership transfer of riak_kv_vnode from 'riak@172.16.3.92' 528081646301205155991565808696313200461616250880 to 'riak@172.16.3.63' 528081646301205155991565808696313200461616250880 failed because of error:{badrecord,ho_acc} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,204}]}] 2023-09-13 11:48:20.321 [error] <0.4371.0> gen_server <0.4371.0> terminated with reason: no function clause matching riak_core_handoff_sender:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 2023-09-13 11:48:20.321 [error] <0.4371.0> CRASH REPORT Process <0.4371.0> with 0 neighbours exited with reason: no function clause matching riak_core_handoff_sender:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 in gen_server:terminate/6 line 744 2023-09-13 11:48:20.322 [error] <0.4368.0> Supervisor {<0.4368.0>,poolboy_sup} had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at <0.4371.0> exit with reason no function clause matching riak_core_handoff_sender:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 in context child_terminated 2023-09-13 11:48:20.323 [error] <0.25322.2>@riak_core_handoff_sender:start_fold:282 ownership transfer of riak_kv_vnode from 'riak@172.16.3.92' 528081646301205155991565808696313200461616250880 to 'riak@172.16.3.63' 528081646301205155991565808696313200461616250880 failed because of error:{badrecord,ho_acc} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,204}]}] 2023-09-13 12:16:23.882 [error] <0.170.0> Supervisor riak_core_handoff_sender_sup had child riak_core_handoff_sender started with {riak_core_handoff_sender,start_link,undefined} at <0.2844.4> exit with reason max_concurrency in context child_terminated crash.log

2023-09-13 11:48:20 =CRASH REPORT==== crasher: initial call: riak_core_vnode_worker:init/1 pid: <0.4371.0> registered_name: [] exception exit: {{function_clause,[{riak_core_handoff_sender,visit_item,[{<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,1 41,135,102,76,129,178,206,178,15,200,127,187,40,0,0,0,0>>},<<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,2,97,1,11 ,48,51,46,50,48,50,50,45,...>>,...],...},...]},...} ancestors: [<0.4368.0>,<0.4367.0>,<0.4366.0>,<0.1819.0>,riak_core_vnode_sup,riak_core_sup,<0.161.0>] messages: [] links: [<0.4367.0>,<0.4368.0>,<0.93.0>] dictionary: [{bitcask_file_mod,bitcask_file},{bitcask_time_fudge,no_testing},{is_visit_item_timer_set,true}] trap_exit: false status: running heap_size: 318187 stack_size: 27 reductions: 37404224106 neighbours: 2023-09-13 11:48:20 =SUPERVISOR REPORT==== Supervisor: {<0.4368.0>,poolboy_sup} Context: child_terminated Reason: {function_clause,[{riak_core_handoff_sender,visit_item,[{<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,76,129,178,206,178,15,200,127,187,40,0,0,0,0>>},<<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,2,97,1,110,5, 51,46,50,48,50,50,45,...>>,...],...},...]} Offender: [{pid,<0.4371.0>},{name,riak_core_vnode_worker},{mfargs,{riak_core_vnode_worker,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]

martinsumner commented 11 months ago

Hard to know - but the code is running a fold over the vnode and sending to the remote vnode at the time ... but in the midst of the logs is a reference to {error,eio} - a generic UNIX socket error? Is this a recurring issue, or might it have been a one-off HW or OS issue?

Konstantin74R commented 11 months ago

CentOS Linux release 7.9.2009 (Core) 3.10.0-1160.81.1.el7.x86_64 #1 SMP Fri Dec 16 17:29:43 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

HW and OS problems not detected.

It's a recurring problem only for one node and only for one partition. Temporary resolve - I was delete this partition from backends on this node to use replication for restore data. Joining new node to cluster was finished. No active transfers.

martinsumner commented 11 months ago

Perhaps there was a corrupted block on the disk within that vnode backend causing an IO error when the fold tried to read a particular part of the keyspace? Deleting the partition would then allow the handoff to complete.

You should be able to run partition repair to restore data with relative efficiency - https://docs.riak.com/riak/kv/latest/using/repair-recovery/repairs/index.html#repairing-a-single-partition.

Konstantin74R commented 11 months ago

I tried to start repair, but console return message to me "change ownership" and no effect. Maybe was. I not found a problems with disk.