Closed sunliusi closed 3 years ago
Do you have the Microsoft.Orleans.Hosting.Kubernetes package installed?
If not, please install it and follow the instructions in the documentation.
I use a different package:Orleans.Clustering.Kubernetes. I'll try Microsoft.Orleans.Hosting.Kubernetes
Thank you, @sunliusi, please let us know how it goes. The hosting and clustering packages are unrelated.
@ReubenBond The Orleans cluster contains three silo, if I restart all three at the same time, that newly added silo will return an error and restart continuously.
When I deployed the deployment of Silo in K8S, three Silos were restarted almost simultaneously, which caused this problem (which will recover automatically after a period of time).
After adding the livenessProbe and readInessprobe the silo pod deploy one by one, the problem was resolved.
I can understand the need for network detection when all silo exits abnormally. but it's too long since the entire cluster is unavailable.
When a new silo starts, any solution make the cluster available immediately (at the expense of some reliability) or shorten the time the cluster is not available
@sunliusi are you using the Kubernetes hosting package? You should not see this issue at all if you are using it and it's configured correctly. Please try following this sample: https://github.com/dotnet/orleans/tree/main/samples/Voting
Like you said, the hosting and clustering packages are unrelated. The problem is the same whether use the Kubernetes hosting package or not.
.UseOrleans((context, builder) =>
{
builder
.Configure<ClusterOptions>(options => { })
.UseAdoNetClustering((Action<AdoNetClusteringSiloOptions>)(options => { }))
.ConfigureEndpoints(11111, 30000, listenOnAnyHostAddress: true)
.Configure<GrainCollectionOptions>(options =>
{
options.CollectionAge = TimeSpan.FromMinutes(10);
})
.Configure<ClusterMembershipOptions>(options =>
{
options.EnableIndirectProbes = true;
})
.ConfigureApplicationParts(parts => parts.AddFromApplicationBaseDirectory())
.AddStartupTask<GlobalStartupTask>()
.AddStartupTask<GrainsLoadHandler>();
})
It's a good idea to close ValidateInitialConnectivity?
Can you show me your pod definition? I don't see the kubernetes hosting package being configured there.
You are missing the .UseKubernetesHosting()
call, for example.
It is not a good idea to set ValidateInitialConnectivity
to false
.
Your issue can almost certainly be fixed by configuring the silo correctly and I can help with that. The sample configures the silo correctly.
UseKuberneteshosting had the same problem, so I switched to UseAdonetClustering. Because of the use of useadonetClustering, the related environment variable in YAML is removed.
apiVersion: apps/v1
kind: Deployment
metadata:
name: orleansdemo
namespace: default
labels:
app: orleansdemo
version: v1
spec:
replicas: 3
selector:
matchLabels:
app: orleansdemo
version: v1
template:
metadata:
labels:
app: orleansdemo
version: v1
spec:
containers:
- name: orleansdemo
image: ${CICD_IMAGE}:${CICD_GIT_BRANCH}-${CICD_GIT_COMMIT}
lifecycle:
postStart:
exec:
command:
- /bin/sh
- '-c'
- >-
mkdir -p /home/mount/${HOSTNAME} /home/admin/logs
&& ln -s /home/mount/${HOSTNAME} /home/admin/logs/orleansdemo
&& echo ${HOSTNAME}
volumeMounts:
- name: tz-config
mountPath: /etc/localtime
- name: log-volume
mountPath: /home/mount
livenessProbe:
httpGet:
path: /default/orleansdemo/ClusterStats
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 10
readinessProbe:
httpGet:
path: /health/status
port: 80
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 2
ports:
- containerPort: 80
- containerPort: 11111
- containerPort: 30000
- containerPort: 8080
volumes:
- name: tz-config
hostPath:
path: /etc/localtime
- name: log-volume
hostPath:
path: /home/admin/logs
type: DirectoryOrCreate
Do I have to use UseKubernetesHosting to deploy in K8S?
In this state, it lasted 10 minutes before returning to normal
You don't have to use UseKubernetesHosting
to deploy in K8s, but if you do, it will cause the problem you are seeing to go away.
It will fix your issue. If you don't use it, then you will continue to see the issue that you are seeing. I believe this is because your silos are not being shutdown gracefully and you are likely shutting them all down between deployments instead of performing rolling upgrades against a healthy cluster.
This is the fifth time I am recommending that you use UseKubernetesHosting
, please try it so you can continue to be productive.
EDIT: note that UseKubernetedHosting
is not related to UseAdoNetClustering
. UseKubernetesHosting
adds support for your silos to talk to Kubernetes to understand what is happening, but it does not replace clustering. You should use both.
UseKubernetedHosting works fine.
The first time I used UseKuberNetedHosting, I deleted lebel and Silo kept restarting:
labels: orleans/serviceId: votingapp orleans/clusterId: votingapp
it seems necessary.
I run silo in k8s, just restart deployment a few times, and reported the error:
Unhandled exception. Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 3 of 3 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.42.7.47:11111:364100355, S10.42.5.212:11111:364100339, S10.42.9.83:11111:364100370] at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity() at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive() at Orleans.Runtime.MembershipService.MembershipAgent.<>c__DisplayClass26_0.<<Orleans-ILifecycleParticipant-Participate>g__OnBecomeActiveStart|6>d.MoveNext()
--- End of stack trace from previous location ---
at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloHost.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
When this error occurs, there is a slow recovery process during which the system is unavailable. How can I speed up the recovery process?