dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.14k stars 2.04k forks source link

Restart Silo in the cluster, how speed up the cluter recovery process? #7160

Closed sunliusi closed 3 years ago

sunliusi commented 3 years ago

I run silo in k8s, just restart deployment a few times, and reported the error:

Unhandled exception. Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 3 of 3 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.42.7.47:11111:364100355, S10.42.5.212:11111:364100339, S10.42.9.83:11111:364100370] at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity() at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive() at Orleans.Runtime.MembershipService.MembershipAgent.<>c__DisplayClass26_0.<<Orleans-ILifecycleParticipant-Participate>g__OnBecomeActiveStart|6>d.MoveNext() --- End of stack trace from previous location --- at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct) at Orleans.LifecycleSubject.OnStart(CancellationToken ct) at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken) at Orleans.Hosting.SiloHost.StartAsync(CancellationToken cancellationToken) at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken) at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken) at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token) at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)

When this error occurs, there is a slow recovery process during which the system is unavailable. How can I speed up the recovery process?

ReubenBond commented 3 years ago

Do you have the Microsoft.Orleans.Hosting.Kubernetes package installed?

If not, please install it and follow the instructions in the documentation.

sunliusi commented 3 years ago

I use a different package:Orleans.Clustering.Kubernetes. I'll try Microsoft.Orleans.Hosting.Kubernetes

ReubenBond commented 3 years ago

Thank you, @sunliusi, please let us know how it goes. The hosting and clustering packages are unrelated.

sunliusi commented 3 years ago

@ReubenBond The Orleans cluster contains three silo, if I restart all three at the same time, that newly added silo will return an error and restart continuously.

When I deployed the deployment of Silo in K8S, three Silos were restarted almost simultaneously, which caused this problem (which will recover automatically after a period of time).

After adding the livenessProbe and readInessprobe the silo pod deploy one by one, the problem was resolved.

I can understand the need for network detection when all silo exits abnormally. but it's too long since the entire cluster is unavailable.

When a new silo starts, any solution make the cluster available immediately (at the expense of some reliability) or shorten the time the cluster is not available

ReubenBond commented 3 years ago

@sunliusi are you using the Kubernetes hosting package? You should not see this issue at all if you are using it and it's configured correctly. Please try following this sample: https://github.com/dotnet/orleans/tree/main/samples/Voting

sunliusi commented 3 years ago

Like you said, the hosting and clustering packages are unrelated. The problem is the same whether use the Kubernetes hosting package or not.

sunliusi commented 3 years ago
.UseOrleans((context, builder) =>
                {
                    builder
                    .Configure<ClusterOptions>(options => { })
                    .UseAdoNetClustering((Action<AdoNetClusteringSiloOptions>)(options => { }))
                    .ConfigureEndpoints(11111, 30000, listenOnAnyHostAddress: true)
                    .Configure<GrainCollectionOptions>(options =>
                    {
                        options.CollectionAge = TimeSpan.FromMinutes(10);
                    })
                    .Configure<ClusterMembershipOptions>(options =>
                    {
                        options.EnableIndirectProbes = true;
                    })
                    .ConfigureApplicationParts(parts => parts.AddFromApplicationBaseDirectory())
                    .AddStartupTask<GlobalStartupTask>()
                    .AddStartupTask<GrainsLoadHandler>();
                })
sunliusi commented 3 years ago

It's a good idea to close ValidateInitialConnectivity?

ReubenBond commented 3 years ago

Can you show me your pod definition? I don't see the kubernetes hosting package being configured there. You are missing the .UseKubernetesHosting() call, for example. It is not a good idea to set ValidateInitialConnectivity to false. Your issue can almost certainly be fixed by configuring the silo correctly and I can help with that. The sample configures the silo correctly.

sunliusi commented 3 years ago

UseKuberneteshosting had the same problem, so I switched to UseAdonetClustering. Because of the use of useadonetClustering, the related environment variable in YAML is removed.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orleansdemo
  namespace: default
  labels:
    app: orleansdemo
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: orleansdemo
      version: v1
  template:
    metadata:
      labels:
        app: orleansdemo
        version: v1
    spec:
      containers:
      - name: orleansdemo
        image: ${CICD_IMAGE}:${CICD_GIT_BRANCH}-${CICD_GIT_COMMIT}
        lifecycle:
            postStart:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - >-
                    mkdir -p /home/mount/${HOSTNAME} /home/admin/logs 
                    && ln -s /home/mount/${HOSTNAME} /home/admin/logs/orleansdemo 
                    && echo ${HOSTNAME}
        volumeMounts:
        - name: tz-config
          mountPath: /etc/localtime
        - name: log-volume
          mountPath: /home/mount
        livenessProbe:
          httpGet:
            path: /default/orleansdemo/ClusterStats
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 15
          failureThreshold: 10
        readinessProbe:
          httpGet:
            path: /health/status
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 5
          successThreshold: 2
        ports:
        - containerPort: 80
        - containerPort: 11111
        - containerPort: 30000
        - containerPort: 8080
      volumes:
      - name: tz-config
        hostPath:
          path: /etc/localtime
      - name: log-volume
        hostPath:
          path: /home/admin/logs
          type: DirectoryOrCreate

Do I have to use UseKubernetesHosting to deploy in K8S?

sunliusi commented 3 years ago

In this state, it lasted 10 minutes before returning to normal

image

ReubenBond commented 3 years ago

You don't have to use UseKubernetesHosting to deploy in K8s, but if you do, it will cause the problem you are seeing to go away.

It will fix your issue. If you don't use it, then you will continue to see the issue that you are seeing. I believe this is because your silos are not being shutdown gracefully and you are likely shutting them all down between deployments instead of performing rolling upgrades against a healthy cluster.

This is the fifth time I am recommending that you use UseKubernetesHosting, please try it so you can continue to be productive.

EDIT: note that UseKubernetedHosting is not related to UseAdoNetClustering. UseKubernetesHosting adds support for your silos to talk to Kubernetes to understand what is happening, but it does not replace clustering. You should use both.

sunliusi commented 3 years ago

UseKubernetedHosting works fine.

The first time I used UseKuberNetedHosting, I deleted lebel and Silo kept restarting:

labels: orleans/serviceId: votingapp orleans/clusterId: votingapp

it seems necessary.