Netflix / eureka

AWS Service registry for resilient mid-tier load balancing and failover.
Apache License 2.0
12.36k stars 3.74k forks source link

TimedSupervisorTask : task supervisor rejected the task (eureka-client 1.10.17) #1510

Open a-filimonov opened 1 year ago

a-filimonov commented 1 year ago

We have an issue with one of the pods in cluster became unreachable due to the fact that Eureka client did not update the service registry cache. Initially this pod started to log WARNs that the thread pool is full:

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@3201d694[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@7ae6c02f[Wrapped task = com.netflix.discovery.DiscoveryClient$CacheRefreshThread@6c874d5e]] rejected from java.util.concurrent.ThreadPoolExecutor@60790f93[Running, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 20397]
    at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
    at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
    at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
    at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
    at com.netflix.discovery.TimedSupervisorTask.run(TimedSupervisorTask.java:66)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

Then after 4 days, the dependent service was redeployed, which caused hosts to change, however the affected pod failed to get the updated list of hosts, and started to fail all the requests, since the dependent service is used in every request.

I/O exception (java.net.NoRouteToHostException) caught when processing request to ->http://10.164.92.119:8080: No route to host (Host unreachable)

This caused a temporary outage on production system until we noticed the pod under issue and rebooted it. All other pods of the same system did not have this issue and were stable.

After some investigation, it seems that this issue was reproducible couple of times on different versions of eureka-client: For reference: https://github.com/Netflix/eureka/issues/907

The version of eureka-client is at the moment: 1.10.17 Spring Cloud version: 3.1.2

Also we noticed that eureka-client was updated only in major release of Spring Cloud: https://mvnrepository.com/artifact/org.springframework.cloud/spring-cloud-starter-netflix-eureka-client/4.0.0

akislyak commented 10 months ago

the same got with eureka-client version 1.10.17 Spring Cloud version: 3.1.7

guerricmerleHUG commented 5 months ago

same error in our production environment with eureka-client version 2.0.1 Spring Cloud version: 2023.0.0 Spring Boot: 3.2.4

logger_name com.netflix.discovery.TimedSupervisorTask

java.util.concurrent.TimeoutException: null
    at java.base/java.util.concurrent.FutureTask.get(Unknown Source)
    at com.netflix.discovery.TimedSupervisorTask.run(TimedSupervisorTask.java:68)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)