derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.32k stars 106 forks source link

Registry might not be cleaning up its CRDTs on termination #19

Closed dazuma closed 6 years ago

dazuma commented 6 years ago

I'm running on master right now.

I start with a horde of two nodes (A and B). I terminate B (by sending a SIGTERM to the Erlang node that hosts it). Then I start B up again (with the same Erlang node name). The restarted node begins displaying errors that look like this:

[error] Discarding message {delta,{<0.260.0>,<0.260.0>,#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalContext',dots=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[],{623221198,0}=>[],{653145801,0}=>[]},version=>2},maxima=>#{347099519=>0,623221198=>0,653145801=>0}},keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>[],<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>[],<<"MQr4dFCEp2++AZuDqgLPrw==">>=>[]},version=>2},state=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{623221198,0}=>[]},version=>2}}}},<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>[]},version=>2},state=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[]},version=>2}}}},<<"MQr4dFCEp2++AZuDqgLPrw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{653145801,0}=>[]},version=>2}}}}}}},2} from <0.260.0> to <0.260.0> in an old incarnation (3) of this node (1)

My guess is that the CRDT running on A never got the memo that B disappeared, and is still trying to send it messages. The restarted B node is (rightfully) not accepting those messages.

I was able to verify that, in this case, Registry.terminate never got called, and so wasn't able to initiate graceful cleanup of its CRDTs. So if I make sure Registry's terminate callback gets called (see https://github.com/dazuma/horde/commit/811a351a301a5485601170cecbbaf205f7d06025) that seems to fix it. But that's probably not foolproof either; I'm sure there are ways to kill a node brutally and not give terminate a chance to execute.

dazuma commented 6 years ago

Ah, I just realized this is effectively the same issue as #16. Sorry for the noise.