I start with a horde of two nodes (A and B). I terminate B (by sending a SIGTERM to the Erlang node that hosts it). Then I start B up again (with the same Erlang node name). The restarted node begins displaying errors that look like this:
[error] Discarding message {delta,{<0.260.0>,<0.260.0>,#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalContext',dots=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[],{623221198,0}=>[],{653145801,0}=>[]},version=>2},maxima=>#{347099519=>0,623221198=>0,653145801=>0}},keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>[],<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>[],<<"MQr4dFCEp2++AZuDqgLPrw==">>=>[]},version=>2},state=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{623221198,0}=>[]},version=>2}}}},<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>[]},version=>2},state=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[]},version=>2}}}},<<"MQr4dFCEp2++AZuDqgLPrw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{653145801,0}=>[]},version=>2}}}}}}},2} from <0.260.0> to <0.260.0> in an old incarnation (3) of this node (1)
My guess is that the CRDT running on A never got the memo that B disappeared, and is still trying to send it messages. The restarted B node is (rightfully) not accepting those messages.
I was able to verify that, in this case, Registry.terminate never got called, and so wasn't able to initiate graceful cleanup of its CRDTs. So if I make sure Registry's terminate callback gets called (see https://github.com/dazuma/horde/commit/811a351a301a5485601170cecbbaf205f7d06025) that seems to fix it. But that's probably not foolproof either; I'm sure there are ways to kill a node brutally and not give terminate a chance to execute.
I'm running on master right now.
I start with a horde of two nodes (A and B). I terminate B (by sending a SIGTERM to the Erlang node that hosts it). Then I start B up again (with the same Erlang node name). The restarted node begins displaying errors that look like this:
[error] Discarding message {delta,{<0.260.0>,<0.260.0>,#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalContext',dots=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[],{623221198,0}=>[],{653145801,0}=>[]},version=>2},maxima=>#{347099519=>0,623221198=>0,653145801=>0}},keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>[],<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>[],<<"MQr4dFCEp2++AZuDqgLPrw==">>=>[]},version=>2},state=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{623221198,0}=>[]},version=>2}}}},<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>[]},version=>2},state=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[]},version=>2}}}},<<"MQr4dFCEp2++AZuDqgLPrw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{653145801,0}=>[]},version=>2}}}}}}},2} from <0.260.0> to <0.260.0> in an old incarnation (3) of this node (1)
My guess is that the CRDT running on A never got the memo that B disappeared, and is still trying to send it messages. The restarted B node is (rightfully) not accepting those messages.
I was able to verify that, in this case, Registry.terminate never got called, and so wasn't able to initiate graceful cleanup of its CRDTs. So if I make sure Registry's terminate callback gets called (see https://github.com/dazuma/horde/commit/811a351a301a5485601170cecbbaf205f7d06025) that seems to fix it. But that's probably not foolproof either; I'm sure there are ways to kill a node brutally and not give terminate a chance to execute.