Stability issues on silos hosted in service fabric

elqueffo commented 7 years ago

We are having major issues with silos that we host in service fabric. The issues seem to all happen during restarts/rebalancing or rolling upgrades. Issues range from:

Split brain and completely isolated silos not seeing each other at all.
Silos apparently just not coming up at all even though logs are clean (i.e. not accepting messages or connections).
Silos that are actually running being permanently marked as Dead and unable to rejoin the cluster.

Right now we seem to be having an issue where the cluster thinks one of the nodes' ports is different from what it actually is. This happened when i scaled down the SF service instance count, and some silos got relocated by SF. We are seing a lot of exceptions about messages and reminders not being delivered to the (currently) incorrect ip:port (10.0.1.8:20002, which should be 10.0.1.8:20001 because that's where it's actually running right now). The MembershipOracle line in the log lists the correct port though!

I am attaching the logs for our current situation and would appreciate some help in figuring out what's going on.

logs.zip

ReubenBond commented 7 years ago

@elqueffo could you share some more information on how you're configuring the cluster? This log snippet is interesting: An established connection was aborted by the software in your host machine. I'm not sure what would be causing that.

Would you be able to increase the verbosity for MembershipOracle and FabricServiceSiloResolver to Verbose2 for me?

What was occurring when these particular logs were captured - was it an upgrade?

Are these all of the logs? It seems like there should be more files, eg I cannot see where the silo named Backend_SiloHostService_1D322399D6B99BD is starting. That silo is on the node 10.0.1.7 with generation 241866905. We only have the log file for the silo with the later generation, 241867604.

elqueffo commented 7 years ago

@ReubenBond Basically, we are trying to migrate our Service Fabric actor based code to Orleans to gain a few needed features. Our configuration is more or less straight out of the sample and we only have 3 grain types. The biggest difference is we hooked up unity for DI, and we had OrleansDashboard running as well (but disabled now to eliminate it as a source of problems). Unfortunately I probably won't be able to untangle our unity stuff to test if that is a source of problems.

   public ClusterConfiguration GetClusterConfiguration()
    {
        var config = new ClusterConfiguration();
        config.Globals.ReminderServiceType = GlobalConfiguration.ReminderServiceProviderType.AzureTable;
        #if DEBUG
        config.AddAzureBlobStorageProvider("SiloGrainState", "UseDevelopmentStorage=true");
        config.Globals.DataConnectionString = "UseDevelopmentStorage=true";
        #else
        config.AddAzureBlobStorageProvider("SiloGrainState", SecretProvider.Get("SiloGrainStateConnectionString"));
        config.Globals.DataConnectionString = SecretProvider.Get("SiloDataConnectionString");
        #endif
        config.Defaults.StartupTypeName = typeof(ClusterStartup).AssemblyQualifiedName;

        /* DASHBOARD DISABLED 
        config.Globals.RegisterDashboard(port: Context.CodePackageActivationContext.GetEndpoint("OrleansDashboardEndpoint").Port); 
        */

        // Set lower colllection time of inactive grains.
        config.Globals.Application.SetDefaultCollectionAgeLimit(TimeSpan.FromHours(1));
        config.Globals.Application.SetCollectionAgeLimit(typeof(BoxGrain), TimeSpan.FromMinutes(30));
        config.Globals.Application.SetCollectionAgeLimit(typeof(CustomerGrain), TimeSpan.FromMinutes(30));

        LogManager.LogConsumers.Add(new EventSourceLogger());
        return config;
    }

    protected override IEnumerable<ServiceInstanceListener> CreateServiceInstanceListeners()
    {

        ClusterStartup.Service = this;
        return new[]
        {
            OrleansServiceListener.CreateStateless(this.GetClusterConfiguration())/* DASHBOARD DISABLED,
            new ServiceInstanceListener(context => new DummyHttpCommunicationListener("OrleansDashboardEndpoint", Context), "OrleansDashboardEndpoint")*/
        };
    }

public class ClusterStartup
{
    public static StatelessService Service { get; set; }

    public IServiceProvider ConfigureServices(IServiceCollection services)
    {
        services.AddServiceFabricSupport(Service);
        // For OrleansDashboard
        /* DASHBOARD DISABLED
        services.AddSingleton<ISiloDetailsProvider, SiloStatusOracleSiloDetailsProvider>();
        */

        var container = ((SiloHostService) Service).Container;
        container.Populate(services);

        return container.Resolve<IServiceProvider>();
    }
}

SiloHostService is a stateless service that is started from a coordinator service, and scaled "manually" by the coordinator. This means the silo instance count doesn't always match the node count. We are running the silo service in ExclusiveProcess mode, and we let SF allocate ports for the silos. We also tried static ports, but then we ran into issues i mentioned with silos being marked Dead forever more (basically until all the nodes restarted).

A third service scales the VM count up or down depending on load. We let service fabric redistribute services among the nodes as it sees fit.

Attached more complete logs from today. You can probably just ignore the issues earlier today that were caused by OrleansDashboard, like port collisions. logs2.zip

I added these as well, I hope that's what you wanted?

        LogManager.AddTraceLevelOverride("MembershipOracle", Severity.Verbose2);
        LogManager.AddTraceLevelOverride("FabricServiceSiloResolver", Severity.Verbose2);

elqueffo commented 7 years ago

Forgot to answer your other question: It was during a scaling operation, but there was also an upgrade happening in those logs at some point because our CI deploys pretty often. It does feel like these issues tend to happen during any kind of start/move/restart operations of the silo process though, regardless of it being an upgrade or scaling.

As for the connection aborted stuff, i have no idea. But it seems to me it's trying real hard to connect to the wrong port on that host. This is the current endpoint there:

{"name":"Backend_SiloHostService_1D3223B52DA780D","silo":"10.0.1.8:20001@241867635","gw":"10.0.1.8:20000@241867635"}

ReubenBond commented 7 years ago

Those are the correct overrides :) I need to sleep soon, but I'm looking into this. Feel free to PM me on gitter: https://gitter.im/ReubenBond of via the main channel: https://gitter.im/dotnet/orleans

Are you sure those ports are incorrect? They're dynamically assigned by SF. I see some silos starting on 20001 and some on 20002 (haven't looked thoroughly yet)

ReubenBond commented 7 years ago

@elqueffo by the way, I notice you're using ActivationCountBasedPlacement. Could you try this all using default placement (Random) and let me know if the experience differs?

elqueffo commented 7 years ago

That paste is the ports being reported by the SF interface on 10.0.1.8. On 10.0.1.6 it is on 20002. I will change to random placement and see what happens.

elqueffo commented 7 years ago

Most of our issues have been mitigated by:

Turning on load shedding so system services have space to run,
3419,
3420,
Fixing a problem in our code which stalled grain activation in some cases, leading to the lock on Catalog being held long enough for things to go crazy.

We are still having a small problem with rolling updates and some race condition with grain handoffs, but we can close this issue and i'll open a new one when we figure out more.

dotnet / orleans

Stability issues on silos hosted in service fabric #3372

3419,

3420,