Orleans Health Check - Githubissues

michaelgeorgeattard commented 6 years ago

We are running Orleans on a Kubernetes cluster.

What is the recommended way to set up readiness/liveliness (health) checks on silos?

jonathansant commented 6 years ago

any updates on this?

benjaminpetit commented 6 years ago

Sorry for the delay.

Orleans silos already monitor each other and update the membership table accordingly, so maybe you could monitor entries in this table?

Another solution would be to have some "health" grains, with a placement director that force activation placement on specific silos.

michaelgeorgeattard commented 6 years ago

What we are trying to achieve:

Currently if a silo dies, cluster keeps running with a reduced capacity. We are attempting to add a liveliness (health) check to each silo to monitor via Kubes, kill the offending silo/s and spawn new ones.

Questions:

Should we add a liveliness HTTP endpoint within the silo application? Or maybe load one as a side-car?
Is the membership table an accurate source of silo state?

jonathansant commented 6 years ago

@benjaminpetit How do you envision the custom placement director to be invoked by K8s' liveness probe (HTTP)? Does it make sense to create a normal cluster client? If so, how do you guarantee that the cluster client only connects to the silo we're trying to health check?

benjaminpetit commented 6 years ago

Is the membership table an accurate source of silo state?

Yes it should be. The only case I can see where it's not it's when no silos in the cluster are running. But the entries will not be updated, so it should be easy to check.

If a silo is running and see that in the membership table it has been voted "dead", the silo will stop its process, so K8s should start a new silo right away.

How do you envision the custom placement director to be invoked by K8s' liveness probe (HTTP)? Does it make sense to create a normal cluster client? If so, how do you guarantee that the cluster client only connects to the silo we're trying to health check?

If you really want to implement a health check like this, I would use a regular client, that would connect only to the silo to monitor, and would call a grain that is made to be activated on the target silo (I would not use a StatelessWorker but a regular Grain with a custom placement directory that inforce the placement, you could use the SiloAddress as the grain key for example?)

But I don't think you really need a health check, since Orleans is already taking care of that.

michaelgeorgeattard commented 6 years ago

@benjaminpetit Thanks for your detailed response.

Please correct me if I'm wrong, however without a proper health check, K8s has no way of knowing that a Silo is dead, and cannot start a new pod/silo?

sergeybykov commented 6 years ago

@michaelgeorgeattard Because Orleans already performs health checks, a dead silo process will be terminated. K8s will see that and will start a new pod/silo.

michaelgeorgeattard commented 6 years ago

@sergeybykov You are correct.

So, seems the issues we are trying to cover are:

Silo is completely unresponsive, considered dead in membership table, but not even able to kill self
Silo is completely unresponsive, but for some reason not considered dead in membership table (ex. https://github.com/dotnet/orleans/issues/4505)

Did I miss a scenario here? From what I understand, it is not worth it to create a custom health check for these scenarios, what do you think?

stephenlautier commented 6 years ago

In the past we had cases were the Silo wasn't active/working, but the process on the silo didn't stop either. We had this happening back when we were still on Docker Swarm (pre 2.0 stable to be fair) and they remained stale (such as this issue https://github.com/dotnet/orleans/issues/4545). Might be on the membership table was marked dead correctly but i dont remember, probably due to voting by other silos (if there are any working active) it will update the table accordingly.

In such cases, we had to restart containers manually, and from swarm you would see e.g. 4/4 containers healthy and you will be running on 1 silo.

Also if I'm not mistaken, if somehow we get to know whether the Silo healthcheck is 100% (e.g. connected to the cluster and alive), in K8's we can use readiness probes, e.g. we check whether the silo has been bootstrapped correctly and its working before it will be added to the k8's cluster and be marked as healthy (I know in Orleans this isnt that big since it handles clustering), but this can be useful for deployments to be smoother; when deploying a new version it will wait for the new version to be marked as alive before deploying the rest. I think sometimes we already get this, if it crashes very very fast (like in Program before starting silo), but if it takes a little bit more time it will be marked as healthy and the rest will also get deployed, resulting in all silos down.

Useful K8's links if you're not familiar.

So basically those are the reasons we require health checks at the K8s, and so that we can look at the health of K8's and confidently we can identify if something is actually not responding.

benjaminpetit commented 6 years ago

I think a simple healthcheck could be to try to launch a local client that connects to the local silo (using a static gateway provider that points only to localhost). If the connection is successful then the silo is alive.

ishowfun commented 5 years ago

How to make health check by Consul?

ReubenBond commented 5 years ago

Here's an example (which may not work - test it) for how to implement health checks in Orleans using ASP.NET & the Microsoft.Extensions.Diagnostics.HealthChecks package: https://gist.github.com/ReubenBond/2881c9f519910ca4a378d148346e9114

@ishowfun I imagine Consul is similar to Kubernetes, so that example may work for you. It may be worth reading the Consul docs (eg https://www.consul.io/docs/agent/checks.html) & perhaps this post https://scottsauber.com/2017/05/22/using-the-microsoft-aspnetcore-healthchecks-package/

JorgeCandeias commented 5 years ago

Here's an example (which may not work - test it) for how to implement health checks in Orleans using ASP.NET & the Microsoft.Extensions.Diagnostics.HealthChecks package: https://gist.github.com/ReubenBond/2881c9f519910ca4a378d148346e9114

This is a good target for another sample - we have ongoing requirements for this too.

DarkCow commented 5 years ago

Here is my example, I think it is based off Reuben though, that one looks better.

    public class OrleansHealthCheck : IHealthCheck
    {
        public OrleansHealthCheck( IClusterClient clusterClient )
        {
            m_ClusterClient = clusterClient;
        }

        public async Task<HealthCheckResult> CheckHealthAsync( HealthCheckContext context, CancellationToken cancellationToken = default )
        {
            if( !m_ClusterClient.IsInitialized )
                return HealthCheckResult.Healthy( "Initializing" );

            var managementGrain = m_ClusterClient.GetGrain<IManagementGrain>( 0 );

            try
            {
                var hosts = await managementGrain.GetHosts( );

                if( hosts.Values.Any( h => !h.IsTerminating( ) && h.IsUnavailable( ) ) )
                {
                    return HealthCheckResult.Degraded( );
                }
                else
                {
                    return HealthCheckResult.Healthy( );
                }
            }
            catch( Exception ex )
            {
                return HealthCheckResult.Unhealthy( exception: ex );
            }
        }

        private readonly IClusterClient m_ClusterClient;
        private readonly ILogger m_Logger = Log.ForContext<OrleansHealthCheck>( );
    }

    public static class OrleansHealthCheckBuilderExtensions
    {
        const string ms_NAME = "orleans_health_check";

        public static IHealthChecksBuilder AddOrleansHealthCheck(
            this IHealthChecksBuilder builder,
            string name = default,
            HealthStatus? failureStatus = default,
            IEnumerable<string> tags = default )
        {
            return builder.Add( new HealthCheckRegistration(
                name ?? ms_NAME,
                sp => new OrleansHealthCheck( sp.GetRequiredService<IClusterClient>( ) ),
                failureStatus,
                tags ) );
        }
    }

SebastianStehle commented 5 years ago

I just check some of my orleans based business logic in the health check.'

I would also not use the orleans health as a health check for kubernetes. Had some bad experience with it. My cluster was not health and the health check actually caused the nodes to be restarted before they could heal up the cluster.

jsukhabut commented 5 years ago

@stephenlautier I agree that readiness probes on the silo are needed on kubernetes deployments. I hacked in one by adding the Orleans Dashboard in selfhost mode on my silos and setting the readiness probe to the dashboard.

          readinessProbe:
            httpGet:              
              path: /dashboard/              
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 15

The reason it works is because the dashboard does not come up until the silo is somewhat ready.

It would be better if we can have a readiness probe that behaves the same way but a little lighter weight than the dashboard.

benjaminpetit commented 3 years ago

Closing this issue since now we have nuget made for K8s hosting

dotnet / orleans

Orleans Health Check #4843