dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
9.89k stars 2.01k forks source link

Errors on kubernetes hosting with different cluster ids? #8971

Closed cangunaydin closed 4 weeks ago

cangunaydin commented 4 weeks ago

Hello, I have two microservices in my application. I follow the docs to create kubernetes hosting. I use version 7.1.2

    <PackageReference Include="Microsoft.Orleans.Server" Version="7.1.2" />
    <PackageReference Include="Microsoft.Orleans.Hosting.Kubernetes" Version="7.1.2" />

when i configure my first api with serviceId: api clusterId: api it works fine on kubernetes. When i try to deploy my other api project to kubernetes with different service id and clusterid. serviceId: approveit-api clusterId: approveit-api Then i get an error on that pod. What is the reason i couldn't make any sense.

[10:13:54 INF] Connection [Local: 10.244.2.56:11111, Remote: 10.244.2.56:52864, ConnectionId: 0HN3BB6UGR7FF] established with S10.244.2.56:11111:73736033
[10:13:54 INF] Connection [Local: 10.244.2.56:52864, Remote: 10.244.2.56:11111, ConnectionId: 0HN3BB6UGR7FE] established with S10.244.2.56:11111:73736033
[10:13:54 INF] Connection [Local: 10.244.2.56:52864, Remote: 10.244.2.56:11111, ConnectionId: 0HN3BB6UGR7FE] established with S10.244.2.56:11111:73736005
[10:13:54 WRN] The target silo became unavailable for message: Request [S10.244.2.56:11111:73736033 sys.client/hosted-10.244.2.56:11111@73736033]->[S10.244.2.56:11111:73736005 sys.svc.manifest/10.244.2.56:11111@73736005] Orleans.Runtime.ISiloManifestSystemTargetOrleans.Runtime.ISiloManifestSystemTarget.GetSiloManifest() #1. See https://aka.ms/orleans-troubleshooting for troubleshooting help. About to break its promise.
[10:13:54 WRN] The target silo became unavailable for message: Request [S10.244.2.56:11111:73736033 sys.svc.dir.client/10.244.2.56:11111@73736033]->[S10.244.2.56:11111:73736005 sys.svc.dir.client/10.244.2.56:11111@73736005] Orleans.Runtime.GrainDirectory.IRemoteClientDirectoryOrleans.Runtime.GrainDirectory.IRemoteClientDirectory.OnUpdateClientRoutes(System.Collections.Immutable.ImmutableDictionary`2[Orleans.Runtime.SiloAddress,System.ValueTuple`2[System.Collections.Immutable.ImmutableHashSet`1[Orleans.Runtime.GrainId],System.Int64]]) #2. See https://aka.ms/orleans-troubleshooting for troubleshooting help. About to break its promise.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.2.56:11111:73736005/xF7A56C87, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.2.55:11111:73734733/x458BC57C, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.2.55:11111:73734906/x6BA728D5, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.2.55:11111:73735856/xD29116DD, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.2.55:11111:73735541/xAE73DD13, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.2.55:11111:73735224/x63772827, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.3.59:11111:73735346/xC2ABC365, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.3.59:11111:73735733/x4EF83307, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.3.59:11111:73734674/x3FD1FD3F, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.3.59:11111:73734785/x679A2277, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.3.59:11111:73734919/x9943135E, since it is a primary directory partition to these grain ids.
[10:13:54 INF] Catalog is deactivating 0 activations due to a failure of silo S10.244.3.59:11111:73735095/x5704A5DA, since it is a primary directory partition to these grain ids.
[10:13:54 ERR] Exception publishing client routing table to silo S10.244.2.56:11111:73736005
Orleans.Runtime.SiloUnavailableException: The target silo became unavailable for message: Request [S10.244.2.56:11111:73736033 sys.svc.dir.client/10.244.2.56:11111@73736033]->[S10.244.2.56:11111:73736005 sys.svc.dir.client/10.244.2.56:11111@73736005] Orleans.Runtime.GrainDirectory.IRemoteClientDirectoryOrleans.Runtime.GrainDirectory.IRemoteClientDirectory.OnUpdateClientRoutes(System.Collections.Immutable.ImmutableDictionary`2[Orleans.Runtime.SiloAddress,System.ValueTuple`2[System.Collections.Immutable.ImmutableHashSet`1[Orleans.Runtime.GrainId],System.Int64]]) #2. See https://aka.ms/orleans-troubleshooting for troubleshooting help.
   at Orleans.Serialization.Invocation.ResponseCompletionSource.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 98
   at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location ---
   at Orleans.Runtime.GrainDirectory.ClientDirectory.PublishUpdates() in /_/src/Orleans.Runtime/GrainDirectory/ClientDirectory.cs:line 499
[10:13:54 WRN] Error retrieving silo manifest for silo S10.244.2.56:11111:73736005
Orleans.Runtime.SiloUnavailableException: The target silo became unavailable for message: Request [S10.244.2.56:11111:73736033 sys.client/hosted-10.244.2.56:11111@73736033]->[S10.244.2.56:11111:73736005 sys.svc.manifest/10.244.2.56:11111@73736005] Orleans.Runtime.ISiloManifestSystemTargetOrleans.Runtime.ISiloManifestSystemTarget.GetSiloManifest() #1. See https://aka.ms/orleans-troubleshooting for troubleshooting help.
   at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 156
[10:13:54 INF] Joining
[10:13:54 WRN] Attempting to send message addressed to S10.244.2.56:11111:73736005 to connection with S10.244.2.56:11111:73736033. Message Request [S10.244.2.56:11111:73736033 sys.client/hosted-10.244.2.56:11111@73736033]->[S10.244.2.56:11111:73736005 sys.svc.manifest/10.244.2.56:11111@73736005] Orleans.Runtime.ISiloManifestSystemTargetOrleans.Runtime.ISiloManifestSystemTarget.GetSiloManifest() #1
[10:13:54 WRN] Attempting to send message addressed to S10.244.2.56:11111:73736005 to connection with S10.244.2.56:11111:73736033. Message Request [S10.244.2.56:11111:73736033 sys.svc.dir.client/10.244.2.56:11111@73736033]->[S10.244.2.56:11111:73736005 sys.svc.dir.client/10.244.2.56:11111@73736005] Orleans.Runtime.GrainDirectory.IRemoteClientDirectoryOrleans.Runtime.GrainDirectory.IRemoteClientDirectory.OnUpdateClientRoutes(System.Collections.Immutable.ImmutableDictionary`2[Orleans.Runtime.SiloAddress,System.ValueTuple`2[System.Collections.Immutable.ImmutableHashSet`1[Orleans.Runtime.GrainId],System.Int64]]) #2
[10:13:55 INF] -BecomeActive
[10:13:55 INF] Completed to save external localizations.
[10:13:55 INF] -Finished BecomeActive.
[10:13:55 INF] Orleans Silo started.
[10:13:55 WRN] Overriding HTTP_PORTS '8080' and HTTPS_PORTS ''. Binding to values defined by URLS instead 'http://+:80'.
[10:13:55 ERR] Hosting failed to start
System.Net.Sockets.SocketException (13): Permission denied
   at System.Net.Sockets.Socket.DoBind(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Bind(EndPoint localEP)
   at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.SocketTransportOptions.CreateDefaultBoundListenSocket(EndPoint endpoint)
   at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.SocketConnectionListener.Bind()
   at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.SocketTransportFactory.BindAsync(EndPoint endpoint, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.TransportManager.BindAsync(EndPoint endPoint, ConnectionDelegate connectionDelegate, EndpointConfig endpointConfig, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl.<>c__DisplayClass28_0`1.<<StartAsync>g__OnBind|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.AddressBinder.BindEndpointAsync(ListenOptions endpoint, AddressBindContext context, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.ListenOptions.BindAsync(AddressBindContext context, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.AnyIPListenOptions.BindAsync(AddressBindContext context, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.AddressBinder.AddressesStrategy.BindAsync(AddressBindContext context, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.AddressBinder.BindAsync(ListenOptions[] listenOptions, AddressBindContext context, Func`2 useHttps, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl.BindAsync(CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl.StartAsync[TContext](IHttpApplication`1 application, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Hosting.GenericWebHostService.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>b__15_1(IHostedService service, CancellationToken token)
   at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)

it crashes and restart.

here is how i configured my two microservice apps.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: prod
  labels:
    app: api
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: api
      orleans/serviceId: api
  template:
    metadata:
      labels:
        app: api
        # This label is used to identify the service to Orleans
        orleans/serviceId: api
        orleans/clusterId: api
    spec:
      volumes:
        # - name: diagnostics
        #   emptyDir: {}
        # - name: dotnet-monitor-config
        #   configMap:
        #       name: dotnet-monitor-config
        - name: adzup-secrets-store
          csi:
            driver: secrets-store.csi.k8s.io
            readOnly: true
            volumeAttributes:
              secretProviderClass: adzup-prod-secrets-provider
      containers:
        - name: api
          image: tribulus.azurecr.io/adzup_api:#{Build.BuildId}#
          ports:
            - containerPort: 80
             # Define the ports which Orleans uses
            - containerPort: 11111
            - containerPort: 30000
          volumeMounts:
            # - mountPath: /diagnostics
            #   name: diagnostics
            - name: adzup-secrets-store
              mountPath: "/mnt/secrets-store"
              readOnly: true
          env:
          # - name: DOTNET_DiagnosticPorts
          #   value: /diagnostics/dotnet-monitor.sock
          - name: ORLEANS_SERVICE_ID
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['orleans/serviceId']
          - name: ORLEANS_CLUSTER_ID
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['orleans/clusterId']
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: DOTNET_SHUTDOWNTIMEOUTSECONDS
            value: "120"
          envFrom:
            - configMapRef:
                name: api-configmap
            - secretRef:
                name: api-secrets
            - secretRef:
                name: azureblob-connstring-secrets
            - secretRef:
                name: postgres-pgbouncer-connstring-secrets
            - secretRef:
                name: redis-connstring-secrets
            - secretRef:
                name: stripe-secrets
            - secretRef:
                name: rabbitmq-secrets
          resources:
            limits:
              cpu: "3000m"
              memory: "4Gi"
            requests:
              cpu: "200m"
              memory: "512Mi"
          imagePullPolicy: Always
      terminationGracePeriodSeconds: 180

and second one.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: approveit-api
  namespace: prod
  labels:
    app: approveit-api
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: approveit-api
      orleans/serviceId: approveit-api
  template:
    metadata:
      labels:
        app: approveit-api
        # This label is used to identify the service to Orleans
        orleans/serviceId: approveit-api
        orleans/clusterId: api
    spec:
      volumes:
        # - name: diagnostics
        #   emptyDir: {}
        # - name: dotnet-monitor-config
        #   configMap:
        #       name: dotnet-monitor-config
        - name: adzup-secrets-store
          csi:
            driver: secrets-store.csi.k8s.io
            readOnly: true
            volumeAttributes:
              secretProviderClass: adzup-prod-secrets-provider
      containers:
        - name: approveit-api
          image: tribulus.azurecr.io/adzup_approveit_api:#{Build.BuildId}#
          ports:
            - containerPort: 80
             # Define the ports which Orleans uses
            - containerPort: 11111
            - containerPort: 30000
          volumeMounts:
            # - mountPath: /diagnostics
            #   name: diagnostics
            - name: adzup-secrets-store
              mountPath: "/mnt/secrets-store"
              readOnly: true
          env:
          # - name: DOTNET_DiagnosticPorts
          #   value: /diagnostics/dotnet-monitor.sock
          - name: ORLEANS_SERVICE_ID
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['orleans/serviceId']
          - name: ORLEANS_CLUSTER_ID
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['orleans/clusterId']
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: DOTNET_SHUTDOWNTIMEOUTSECONDS
            value: "120"
          envFrom:
            - configMapRef:
                name: approveit-api-configmap
            - secretRef:
                name: azureblob-connstring-secrets
            - secretRef:
                name: postgres-pgbouncer-connstring-secrets
            - secretRef:
                name: redis-connstring-secrets
            - secretRef:
                name: rabbitmq-secrets
          resources:
            limits:
              cpu: "3000m"
              memory: "4Gi"
            requests:
              cpu: "200m"
              memory: "512Mi"
          imagePullPolicy: Always
      terminationGracePeriodSeconds: 180
cangunaydin commented 4 weeks ago

i have figured it out after sometime. It was stupid of me. i forgot to change the dockerfile to run as root, that's why i was having a trouble. sorry for inconvenience.