Netflix / Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
24.13k stars 4.71k forks source link

Hystrix ThreadPool Rejection issue #1596

Open RakeshAMore opened 7 years ago

RakeshAMore commented 7 years ago

Hi Team, We are facing some weird issue in our service while releasing new version of service. Circuit getting triggered(Reason : ThreadPool rejection) every time when we switch to new version of service. Our service normally have traffic of 150-200 request per second. And once we switch to new version it circuit getting trip with below error, although, coreSize is 20 and maximumSize is 100 for hystrix command. "Task java.util.concurrent.FutureTask@1ad6194b rejected from java.util.concurrent. ThreadPoolExecutor@6a085fce[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]"

It possibly happened because hystrix could not acquire enough resources to start up it’s thread pool (look at exception). But once the threads were acquired, the circuit was closed.

Have you seen such issue before, Any work around which I can try.

RakeshAMore commented 7 years ago

Surely, this is happening because of ThreadPoolExecute behaviour. Threadpool create threads based on task getting submitted to it. http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.html.
According to java doc, When a new task is submitted, and fewer than corePoolSize threads are running, a new thread is created to handle the request, even if other worker threads are idle. If there are more than corePoolSize but less than maximumPoolSize threads running, a new thread will be created only if the queue is full but in case of hystrixThreadPool we never queued the task unless we have used properties maxQueueSize and queueSizeRejectionThreshold.

So, theoretically what happening in our case is;

request per seconds  ~200 ,
approximate response time = 500 milliseconds. 
We have set coreSize= 20 and maximumSize=100 for threadpool properties of Command. 

When 150 to 200 rps arrives at the time of new version deployment, hystrix thread pool behave as following. It start creating threads till maximumSize as hystrix threadpool not queueing the requests. And after reaching to limit 100 it starting throwing thread pool rejection error (thread rejection can be happened while creating threads as well, as observed weired logs like thread pool rejected with core size 3)

Solution There are two solutions which I tried for this.

  1. activating queue size i.e. set appropriate values for properties maxQueueSize and queueSizeRejectionThreshold.
  2. Increase maximumSize to 200

I followed option 2 as it seems more reliable, because anyways idle thread will be terminated to free resource as per the default value of property keepAliveTimeMinutes Let me know your thoughts on this.

RakeshAMore commented 7 years ago

Any comments on this?

mattrjacobs commented 7 years ago

In general, we haven't internally observed behavior like this. JVM apps generally perform worse at startup than in steady-state, due to JIT-compilation.

A couple of ideas to see where your problem lies: 1) Print out the latency for each command. Possibly the initial requests are very latent and consuming the threadpool for longer than you think. 2) Build a version of Hystrix that starts all core threads, and try that in your application. If starting threads is the issue, this would be a solution (I think).

RakeshAMore commented 7 years ago

Yes agree it could be due to JIT-compilation. as initial latency of command is 100+ miliseconds more what I expect. I also tries with solution of prestartAllCoreThreads() for hystrix threadpool but no luck.