akkadotnet / Akka.Management

Akka.NET cluster management, bootstrapping, and more.
Other
27 stars 11 forks source link

Singleton with kubernetes lease does not recover when lease is reacquired #2490

Open SchiessMax opened 1 month ago

SchiessMax commented 1 month ago

Version Information Akka.Coordination.KubernetesApi 1.5.19 Akka.Cluster 1.5.19

Describe the bug We had a problem in our test system where a singleton could not be contacted over it's proxy any more. According to logs this might have happened because the lease was reaquired. The bug can be reproduced by simply deleting the kubernetes lease using kubectl. This leads to the service reacquiring the lease.

To Reproduce Steps to reproduce the behavior:

  1. Create a microservice that defines a singleton and a proxy for it and uses the kubernetes lease
  2. Start the service - lease should be acquired and it should be possible to interact with the singleton over it's proxy
  3. Delete the lease using kubectl
  4. The lease will be reacquired by the microservice but the singleton is not reachable over the proxy

Expected behavior The lease is reacquired and the proxy can reach the singleton again

Actual behavior The lease was reacquired but the proxy could not reach the singleton.

Environment Are you running on Linux? Windows? Docker? Which version of .NET? net8.0, linux docker images inside a kubernetes cluster

Additional context Logs after start:

[INFO][05/13/2024 07:30:40.997Z][Thread 0026][Cluster (akka://CasActorSystem)] Cluster Node [akka.tcp://CasActorSystem@10.0.4.219:35023] - Welcome from [akka.tcp://CasActorSystem@10.0.4.118:37339]
[INFO][05/13/2024 07:30:47.010Z][Thread 0013][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] Trying to acquire lease before starting singleton
[INFO][05/13/2024 07:30:47.032Z][Thread 0013][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] ClusterSingletonManager state change [Start -> AcquiringLease] Akka.Cluster.Tools.Singleton.Uninitialized
[INFO][05/13/2024 07:30:47.203Z][Thread 0026][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] Acquire lease result True
[INFO][05/13/2024 07:30:47.204Z][Thread 0026][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] Singleton manager started singleton actor [akka://CasActorSystem/user/edge-directory-service/edge-directory-service] 
[INFO][05/13/2024 07:30:47.204Z][Thread 0026][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] ClusterSingletonManager state change [AcquiringLease -> Oldest] Akka.Cluster.Tools.Singleton.AcquiringLeaseData
[INFO][05/13/2024 07:30:47.989Z][Thread 0007][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service-proxy] Singleton identified at [akka://CasActorSystem/user/edge-directory-service/edge-directory-service]

Logs after deleting the lease:

[WARNING][05/13/2024 08:01:02.567Z][Thread 0049][akka.tcp://CasActorSystem@10.0.4.219:35023/user/KubernetesLease2] Failure during heartbeat to lease. Lease assumed to be released.
Cause: Akka.Coordination.LeaseException: PUT for lease casactorsystem-singleton-akkacasactorsystemuseredge-directory-s returned unexpected status code $NotFound. Reason: [Not Found]Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"leases.akka.io \"casactorsystem-singleton-akkacasactorsystemuseredge-directory-s\" not found","reason":"NotFound","details":{"name":"casactorsystem-singleton-akkacasactorsystemuseredge-directory-s","group":"akka.io","kind":"leases"},"code":404}

   at Akka.Coordination.KubernetesApi.Internal.KubernetesApiImpl.UpdateLeaseResource(String leaseName, String ownerName, String version, Nullable`1 time)
[WARNING][05/13/2024 08:01:02.567Z][Thread 0049][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] Lease has been lost. Terminating singleton and trying to re-acquire lease
Cause: Akka.Coordination.LeaseException: PUT for lease casactorsystem-singleton-akkacasactorsystemuseredge-directory-s returned unexpected status code $NotFound. Reason: [Not Found]Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"leases.akka.io \"casactorsystem-singleton-akkacasactorsystemuseredge-directory-s\" not found","reason":"NotFound","details":{"name":"casactorsystem-singleton-akkacasactorsystemuseredge-directory-s","group":"akka.io","kind":"leases"},"code":404}

   at Akka.Coordination.KubernetesApi.Internal.KubernetesApiImpl.UpdateLeaseResource(String leaseName, String ownerName, String version, Nullable`1 time)
[INFO][05/13/2024 08:01:02.567Z][Thread 0049][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] ClusterSingletonManager state change [Oldest -> AcquiringLease] Akka.Cluster.Tools.Singleton.OldestData
[INFO][05/13/2024 08:01:02.567Z][Thread 0049][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] Releasing lease as leaving Oldest
[INFO][05/13/2024 08:01:02.569Z][Thread 0049][akka.tcp://CasActorSystem@10.0.4.219:35023/user/KubernetesLease2] Release request for owner CasActorSystem@10.0.4.219:35023 lease casactorsystem-singleton-akkacasactorsystemuseredge-directory-s while previous acquire/release still in progress. Current state: Akka.Coordination.KubernetesApi.LeaseActor+Idle
[WARNING][05/13/2024 08:01:02.571Z][Thread 0049][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] unhandled event Failure: System.AggregateException: One or more errors occurred. (Tried to release a lease that is not acquired)
---> Akka.Coordination.LeaseException: Tried to release a lease that is not acquired
   at Akka.Coordination.KubernetesApi.KubernetesLease.<>c.<Release>b__17_0(Task`1 t)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Akka.Cluster.Tools.Singleton.ClusterSingletonManager.<>c.<InitializeFSM>b__42_21(Task`1 r)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at Akka.Actor.PipeToSupport.PipeTo[T](Task`1 taskToPipe, ICanTell recipient, Boolean useConfigureAwait, IActorRef sender, Func`2 success, Func`2 failure) in state AcquiringLease
[INFO][05/13/2024 08:01:02.578Z][Thread 0032][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] Singleton actor terminated. Trying to acquire lease again before re-creating.
[INFO][05/13/2024 08:01:02.578Z][Thread 0032][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] ClusterSingletonManager state change [AcquiringLease -> AcquiringLease] Akka.Cluster.Tools.Singleton.AcquiringLeaseData
[INFO][05/13/2024 08:01:02.579Z][Thread 0032][akka://CasActorSystem/user/edge-directory-service/edge-directory-service] Message [RegisterEdgeActor] from [akka://CasActorSystem/user/EdgeActor-endurance-tests-24-92e81ee9-5480-4e76-9f29-83171afa7b14#981274621] to [akka://CasActorSystem/user/edge-directory-service/edge-directory-service#1238423219] was not delivered. [9] dead letters encountered. If this is not an expected behavior then [akka://CasActorSystem/user/edge-directory-service/edge-directory-service#1238423219] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: CODESYS.CAS.EdgeApi.Directory.DirectoryActor+RegisterEdgeActor
[INFO][05/13/2024 08:01:02.579Z][Thread 0032][akka://CasActorSystem/user/edge-directory-service/edge-directory-service] Message [RegisterEdgeActor] from [akka://CasActorSystem/user/EdgeActor-juergens-9-38e28d83-f368-4510-8cec-5fed74d6d0a7#749959556] to [akka://CasActorSystem/user/edge-directory-service/edge-directory-service#1238423219] was not delivered. [10] dead letters encountered, no more dead letters will be logged in next [00:05:00]. If this is not an expected behavior then [akka://CasActorSystem/user/edge-directory-service/edge-directory-service#1238423219] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: CODESYS.CAS.EdgeApi.Directory.DirectoryActor+RegisterEdgeActor
[INFO][05/13/2024 08:01:02.590Z][Thread 0032][KubernetesApiImpl (akka://CasActorSystem)] lease casactorsystem-singleton-akkacasactorsystemuseredge-directory-s does not exist, creating
[INFO][05/13/2024 08:01:02.634Z][Thread 0048][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] Acquire lease result True
[INFO][05/13/2024 08:01:02.634Z][Thread 0048][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] Singleton manager started singleton actor [akka://CasActorSystem/user/edge-directory-service/edge-directory-service] 
[INFO][05/13/2024 08:01:02.634Z][Thread 0048][akka.tcp://CasActorSystem@10.0.4.219:35023/user/edge-directory-service] ClusterSingletonManager state change [AcquiringLease -> Oldest] Akka.Cluster.Tools.Singleton.AcquiringLeaseData
SchiessMax commented 3 weeks ago

A short update from our side: The problem occured in our productive system, so we now removed the kubernetes lease from our code. We will use it again once it's more stable.

Aaronontheweb commented 2 weeks ago

Thanks @SchiessMax for the update - we'll have to take a look at this too