asynkron / protoactor-dotnet-contrib

Contributions for ProtoActor .NET
http://proto.actor
Apache License 2.0
3 stars 2 forks source link

Cluster member does not recover correctly in case of issues with Consul even after #488 #3

Open timurnes opened 4 years ago

timurnes commented 4 years ago

Hello. I was playing with Proto Cluster on the dev branch and I've found an issue in case of error coming from Consul on request to register service on re-registration step. In my case it was "no space left on device" and requests fail.

We have this code in ConsulClusterMonitor:

var task = context.Message switch
{
    RegisterMember cmd => Register(cmd, context),
    CheckStatus cmd => NotifyStatuses(cmd.Index, context.Self),
    DeregisterMember _ => UnregisterService(context),
    UpdateStatusValue cmd => RegisterService(cmd.StatusValue, context),
    ReregisterMember _ => RegisterService(_statusValue, context),
    Stopping _ => Stop(),
    _ => Task.CompletedTask
};
await task.ConfigureAwait(false); 

In case of unhandled exception in these methods (in switch) ConsulClusterMonitor actor can fail and will be recovered later. But we will have NullReferenceException on line 79, when ReregisterMember message is received: {"StatusValue", _statusValueSerializer.Serialize(statusValue)}, _statusValueSerializer can be null here because it is assigned in Register method. And Register method is used only once, when cluster member joins a cluster. So this actor fails again and again

Usually it is normally recovered, but sometimes I caught this exception.

Expected Behavior

ConsulClusterMonitor actor is correctly recovered after any issues during requests to Consul after successful registration in cluster

Actual Behavior

ConsulClusterMonitor actor recovered but sometimes throws an exception after ReregisterMember message on line 79

Steps to Reproduce the Problem

I really don't know how to reproduce issues with Consul, also I caught this issue only few times with these steps

  1. On the first request to update TTL Consul should answer with error "CheckID does not have associated TTL". So ConsulCheckin actor will poison itself and send ReregisterMember message to Monitor.
  2. Monitor is going to process re-registration, but on the request to RegisterService Consul returns error "no space left on device". Monitor actor fails.
  3. After Monitor actor recovers, on another re-registration step _statusValueSerializer can be null (not always, I don't know why) and NullReferenceException is thrown
alexeyzimarev commented 4 years ago

Sounds like a real edge-case :) To be honest, I never seen the status value being updated. Maybe someone from the founding fathers can explain what it is. It is also usually a singleton. It's possible to override it in cluster settings, yes, but do we expect to host like two clusters? I am not sure how to address this.