Grain not fully collected on silo crash

NQ-Brewir commented 1 year ago

Hello,

I'm running Orleans v3.6.2 as a cluster for a production application, and I'm experiencing some crashes (cf. https://github.com/dotnet/runtime/issues/72365, not the bug we are discussing here). I also have a grain singleton (GuidEmpty always) with a running timer logging some data. In the OnDeactivateAsync of this grain, the timer is disposed.

Bug observation: After a(not all of them) crash of a Silo, I have the logs comming from 2 instances of the grain, one of them with wrong values (not updated with calls), and the other with the correct values. Expected: I have logs for only one source (as it is a grain singleton always called with id GuidEmpty).

I suspect that the portion of the directory holding the original grain was on the silo that crashed, however, I find strange that:

the original grain was not collected
the OnDeactivateAsync was not called on the original grain
the Timer still ticks

Regards

nkosi23 commented 1 year ago

@NQ-Brewir hello! Regarding OnDeactivate, the documentation warns that this method might not be called when a silo crashes and that as a result we should not assume that it will always be called. We are even invited to not put critical business logic there (such as persisting the grain only OnDeactivate and assuming that it will always be called). There is no guarantee that OnDeactivate will be called because grains are orchestrated by silos but if the silo crashes we are in undefined territory since any component could potentially be broken. While a cluster as a whole is resilient, individual silos are not necessarily recoverable.

You should also expect that there may be multiple grain activations running concurrently for brief period of times, particularly in situations where the cluster is not stable (the challenge here is how to distinguish between when a silo is just slow to respond, is temporarily unavailable, or has permanently gone. while Orleans does a great job with this, because of the nature of distributed systems - and of the problem - the system can only be eventually consistent). Once again the problematic situations are crashes since during graceful shutdowns silos can notify that they are leaving the cluster.

I would not be surprised that the Grain is not collected, since how could the GC collect it when a timer is still running? From the perspective of garbage collection the object is still being used. Timers are managed by grain activations while reminders are managed by silos (reminders are transparently sharded across the cluster to enable scalability).

Finally, I do not know if crashed silos can be recovered. Generally speaking, if the host of a runtime crashes, I would say the safest course of action is to terminate and restart the host process asap since you cannot know what has been broken by the crash. For example if the operating system of your computer reports kernel crashes and panics, would you carry on or reboot the computer asap? If you carry on, you risk seeing your applications be broken in unexpected and potentially dangerous ways.

Killing the host (the silo in this case) is particularly safe and convenient to do with Orleans since it has been designed with this scenario in mind (grain location is transparent, and silos may join and leave a cluster at any time). I would therefore simply terminate the process after logging the exception, and start a new one (I use supervisord to automate restarts, but other tools such as kubernetes have similar capabilities).

SebastianStehle commented 1 year ago

Great answer

NQ-Brewir commented 1 year ago

@nkosi23 Thanks for your answer.

The problem is that I have multiple grain activation, but I don't consider a few days being a brief period of time. The only way I have to know if the timer should be disabled is the OnDeactivate, and it's not called on the first activation.

The silo not being recovrable is not a problem, and I don't do anything critical in the OnDeactivate, only timer cleanup.

The only problem here, again, is having multiple activations of the same Grain (but the cluster seems to only know one of them). It would be great to have a way to collect all dangling grains.

nkosi23 commented 1 year ago

@NQ-Brewir you need to add logic to your application ensuring that when a Silo crashes the process of the application is terminated immediately. It seems to me that the fact that you keep your silos running after a crash is likely the actual problem that needs to be fixed. If this is indeed what you doing, Just exit the application after logging the exception received by the silo and use an external process to restart the silo.

Otherwise I am not sure what you expect from Orleans, maybe you are trying to recycle the process by catching the exception and starting a new host without exiting the application? In this case it is possible that some Dispose() calls may be missing in Orleans, but before that the problem may simply be that your host is not wrapped in a using clause?

However I am not even sure this approach can actually work with Orleans because a reference to the timer is needed to be able to dispose it, yet the Orleans framework does not know about your timer and therefore wouldn't be able to dispose it automatically when an exception is thrown at the silo level. For this reason alone, I'd say that the only option you have is to exit the application, and restart it. And I'd not be surprised if Timers are not the only components making such a tradeoff since expecting users not to recycle a host process is a wise and reasonable tradeoff that can simplify things here and there while nobody really needs to reuse the same process anyway.

Generally speaking, you'd only be gaining troubles with this type of recycling because this sort of code paths is almost never used (and needed) and can be quite hard to test. It is much easier and cleaner to start fresh when a host crashes, whether you are using Orleans or most other technologies (the exception would be platforms like Elixir/Erlang).

NQ-Brewir commented 1 year ago

@nkosi23 When a Silo crashes, the silo crashes, and nothing running on it continues. It is then restarted. But I don't expect to restart the full cluster because a silo crash.

My understanding of the problem is that on said silo crashing, the portion of the directory telling the cluster where is each grain in which the grain I'm talking about was referenced is destroyed. The grain is still activated (as it was not on the silo that crashed), but dangling as not referenced in the directory.

What I expect from Orleans at this point is to deactivate this dangling grain (and the others in the same case that may not be visible). Or even better, avoid these grains to become dangling because a portion of the directory was destroyed by adding more replication.

nkosi23 commented 1 year ago

Thanks now I understand the problem better, it sounded like you kept using the Silo that crashed. If the last activation is the only one receiving messages I doubt the problem comes from the GrainDirectory. And as far as activation garbage collection is concerned this page provides details on what is taken into account:

The following counts as being active

Receiving a method call.

Receiving a reminder.

Receiving an event via streaming.

And the following does not count as being active

Performing a call (to another grain or an Orleans client)

Timer events

Arbitrary IO operations or external calls not involving Orleans framework.

The page also explains that the deactivation process is managed by each individual silo (there is no central authority), therefore the fact that some stale entry may still be in the grain directory is not important since only 1 activation is currently receiving the call (and as such the directory is behaving correctly).

My guess is the old activation got properly deactivated and collected from the perspective of Orleans because it was idle (otherwise it would still be receiving the messages). But that your timer code wasn't fully/properly disposed. To confirm this, you could try to cause the grain to be deactivated and activated multiple times in a row (by setting a very low idle time) to see if indeed no activation is properly collected and your logs keep getting populated by all the previous activations.

If this is confirmed, the first suspect would be your timer code, maybe it is complex and something is not properly disposed. Would you be able to post it here? However if this is not the problem, this could be a bug ~~since you are using a fairly old version of Orleans~~. If you post your code maybe other people will have more ideas.

NQ-Brewir commented 1 year ago

The timers are registered calling this function:

        public void NQRegisterTimer(string name, Func<object?, Task> asyncCallback, object? state, TimeSpan dueTime, TimeSpan period)
        {
            if (_timers.TryGetValue(name, out var t))
                t.Dispose();
            _timers[name] = base.RegisterTimer(asyncCallback, state, dueTime, period);
        }

and should be deleted in:

        public override Task OnDeactivateAsync()
        {
            foreach (var timer in _timers.Values)
                timer.Dispose();
            _timers.Clear();

            return base.OnDeactivateAsync();
        }

This is the reason why I had the impression that I had some grains that were not properly deactivated, even after a few days of only the timer running

nkosi23 commented 1 year ago

Can you please also post the code responsible for starting your silos?

NQ-Brewir commented 1 year ago

Here it is:

           return builder.UseOrleans((ctx, hostBuilder) =>
            {
                hostBuilder.AddGrainRouter();

                var orleansConfig = Config.Instance.orleans;

                if (orleansConfig.local_silo)
                {
                    hostBuilder
                        .UseDevelopmentClustering(primarySiloEndpoint: null);
                }
                else
                {
                    hostBuilder.UseAdoNetClustering(options =>
                    {
                        options.ConnectionString = pgConnString;
                        options.Invariant = "Npgsql";
                    });
                }

                if (orleansConfig.enable_reminders)
                {
                    hostBuilder.UseAdoNetReminderService(options =>
                    {
                        options.ConnectionString = pgConnString;
                        options.Invariant = "Npgsql";
                    });
                }

                var telemetry_key = orleansConfig.telemetry_key;
                if (!string.IsNullOrWhiteSpace(telemetry_key))
                    hostBuilder.AddApplicationInsightsTelemetryConsumer(telemetry_key);

                var grainassembly = typeof(BarterGrain).Assembly;
                var interfaces = orleansConfig.interfaces;
                if (interfaces.Count != 0)
                {
                    hostBuilder.Configure<GrainClassOptions>(o =>
                    {
                        foreach (Type type in grainassembly.GetTypes())
                        {
                            if (typeof(IGrainWithIntegerKey).IsAssignableFrom(type)
                                || typeof(IGrainWithStringKey).IsAssignableFrom(type))
                            {
                                Console.WriteLine($"Found grain type {type.FullName}.");
                                var keep = false;
                                foreach (Type intf in type.GetInterfaces())
                                {
                                    if (intf.FullName.StartsWith("NQ.Interfaces"))
                                        Console.WriteLine($"   implementing {intf.FullName}");
                                    if (interfaces.Contains(intf.FullName)
                                        || interfaces.Contains(intf.Name))
                                        keep = true;
                                }
                                if (!keep)
                                    o.ExcludedGrainTypes.Add(type.FullName);
                                Console.WriteLine("   " + (keep ? "KEEP" : "EXCLUDE"));
                            }
                        }
                    });
                }

                hostBuilder
                    .AddIncomingGrainCallFilter<ExceptionCallFilter>()
                    .AddIncomingGrainCallFilter<CallFilter>()
                    .AddOutgoingGrainCallFilter<StatsCallFilter>()
                    .NQConfigureSiloTelemetry();

                hostBuilder
                    .UseTransactions()
                    .Configure<EndpointOptions>(opt =>
                    {
                        if (orleansConfig.local_silo)
                        {
                            var siloAddress = orleansConfig.local_silo_address;
                            opt.AdvertisedIPAddress = IPAddress.Parse(siloAddress);
                        }
                        else
                        {
                            opt.AdvertisedIPAddress = Dns.GetHostAddresses(Dns.GetHostName())
                                .Where(addr => addr.AddressFamily == AddressFamily.InterNetwork && !IPAddress.IsLoopback(addr)).First();
                        }

                        opt.SiloPort = orleansConfig.silo_port;
                        opt.GatewayPort = orleansConfig.gateway_port;
                    })
                    .UseDashboard(options =>
                    {
                        var dashboardPort = orleansConfig.dashboard_port;
                        if (dashboardPort != default)
                        {
                            options.Port = dashboardPort;
                        }
                        options.CounterUpdateIntervalMs = (int)orleansConfig.dashboard_update_interval.TotalMilliseconds;
                    })
                    .UseLinuxEnvironmentStatistics()
                    .Configure<ClusterOptions>(options =>
                    {
                        options.ClusterId = orleansConfig.cluster_id;
                        options.ServiceId = orleansConfig.service_id;
                    })
                    .Configure<GrainCollectionOptions>(options =>
                    {
                        options.CollectionAge = orleansConfig.grain_collection_age;
                    })
                    .ConfigureApplicationParts(parts => parts.AddApplicationPart(typeof(Program).Assembly).WithReferences())
                    .AddRedisGrainStorage(InternalPubSubConfig.SessionStorage, options =>
                    {
                        var settings = Config.Instance.redis_orleans_session;
                        options.ConnectionString = $"{settings.host}:{settings.port}";
                        options.DatabaseNumber = settings.db;
                        options.UseJson = false; // no need as deleted each resart
                        options.DeleteOnClear = true;
                    })
                    .AddAdoNetGrainStorage(InternalPubSubConfig.StorageName, options =>
                    {
                        options.Invariant = "Npgsql";
                        options.UseJsonFormat = true;
                        options.ConnectionString = pgConnString;
                    })
                    .AddAdoNetGrainStorageAsDefault(options =>
                    {
                        options.Invariant = "Npgsql";
                        options.UseJsonFormat = true;
                        options.ConnectionString = pgConnString;
                        /* Postgresql was re-ordering some JSON properties, causing issues when trying to de-serialize.
                         * The use of MetadataPropertyHandling.ReadAhead was suggested at (https://github.com/dotnet/orleans/issues/5686).
                         * It will force the metadata to be parsed before the rest. The drawback is that it will impact performance (https://www.newtonsoft.com/json/help/html/T_Newtonsoft_Json_MetadataPropertyHandling.htm).
                        */
                        options.ConfigureJsonSerializerSettings = delegate (JsonSerializerSettings jsonSettings)
                        {
                            jsonSettings.MetadataPropertyHandling = MetadataPropertyHandling.ReadAhead;
                            // TODO:
                            // In case Grain name or namespaces are changed for data being serialized, the serializer doesn't find the type.
                            // This line fix the issue, but we need to ensure this is the kind of behaviour we want to keep
                            jsonSettings.TypeNameHandling = TypeNameHandling.None;
                        };
                    })
                    .AddSimpleMessageStreamProvider(InternalPubSubConfig.ProviderName, options =>
                    {
                        options.FireAndForgetDelivery = true;
                        options.OptimizeForImmutableData = true;
                    })
                    .AddStartupTask(async (IServiceProvider services, CancellationToken cancellation) =>
                    {
                        // we need orleans to be started before runing this
                        var grainFactory = services.GetRequiredService<IGrainFactory>();
                        if (await grainFactory.GetBootstrapGrain().IsStackReady())
                        {
                            await grainFactory.GetReminderGrain().Start();
                            // wake up to avoid first client getting an error
                            await grainFactory.GetSPSDirectoryGrain().Get();
                        }
                    });

nkosi23 commented 1 year ago

Can you please include the part where you call IHost.Run() so we can have a sense of the flow of the main loop and eliminate some possibilities.

NQ-Brewir commented 1 year ago

            var hostBuilder = StartupHelper.CreateHostBuilder("orleans", args)
                .ConfigureLogging(logging => logging.Setup(logWebHostInfo: false))
                .ConfigureOrleans()
                .ConfigureAspNet();

            using var host = hostBuilder.Build();
            await host.RunAsync();

dotnet / orleans

Grain not fully collected on silo crash #8171