Netflix / concurrency-limits

Apache License 2.0
3.25k stars 310 forks source link

GC Pauses makes maxInflightRequest highly volatile when response times are lower than the pauses itself #156

Open martinlocurcio opened 4 years ago

martinlocurcio commented 4 years ago

Hi. I've just ask in stackoverflow this question.

https://stackoverflow.com/questions/59311752/how-to-limit-concurrency-in-a-webapp-when-gc-pauses-last-more-than-the-average-r#

I've forked this project and implemented in an application and I realized that the GC Pauses last long than my average response time, so when tracking the value of maxInflightRequest I can see that when a GC (Minor o Major) is performed the value of maxInflightRequest goes up and reach the threshold that I've configured stressing the application. So I'm having rejections that should've been processed.

All the details are in the stackoverflow question

IgorPerikov commented 4 years ago

I think you either tune GC or change architecture so there are some sort of sidecar proxy, which can observe latency from the outside and not affected by the application GC. Second approach seems way harder 😃

martinlocurcio commented 4 years ago

Hi @IgorPerikov , thanks for your answer! I've though about that approach but I'm thinking that it may be the same. Let's say that I implement an API management gateway just for limiting the concurrency (Let's assume that there won't be GC pauses in the gateway). The API management gateway will keep the value of inFlightRequest, and reject the request if it goes up to a given threshold. The average value of inFlightRequest that the gateway will report will be pretty much the same as the one that I'm having now or maybe not the same due to network time but it will increase the value of inflightRequest to a another number. When a GC is triggered in the service that value will increase in the gateway as well. I think that the only think that can help me here is if the value of inflightRequest reported from an API gateway is less volatile in absolute values. For instance my current average value is 2 to 3 and when a GC is triggered it goes up to 45-60 (In full gc); If I can reduce that difference and make the inflightRequest measure less volatile maybe it will behave better, but I think that there is no way to tell in advance without trying it.

IgorPerikov commented 4 years ago

oh, I was answering another "question" 😅I thought you had problems with calculating rtt time correctly because of GC

Now I see.

the value of maxInflightRequest goes up and reach the threshold that I've configured stressing the application. So I'm having rejections that should've been processed.

It seems they should've not been processed according to your configuration. You limited amount of inflight requests and long GC cycle means your application struggles, so it should reject some to be able to recover from GC impact.

If your service is latency-critical it might be fine to reject them(retried request will likely come to less busy server), if extra waiting is fine - you can queue them

martinlocurcio commented 4 years ago

RTT time calculation is affected for sure, but since I'm using a windowed strategy using percentile 0.5 instead of an average it does not implies a real issue.

I'm not sure if a long GC cycle implies that the application is truly struggling. Before implementing the concurrency limiter there were some timeout in the client side (400 ms), taking into consideration network times. But I would like to show one quick example where the outcome is not the best.

Let's say that my max amount of concurrent request is 3. So, R1, R2, R3 arrives to my web application under a tomcat container. That implies that those 3 request were put first into a SO queue and then polled by the container (https://medium.com/netflix-techblog/tuning-tomcat-for-a-high-throughput-fail-fast-system-e4d7b2fc163f). Then a GC pause is triggered. Meanwhile, another 3 request arrives. R4,R5,R6 will be added to the SO queue waiting the gc pause to end, those 3 request will not be rejected since the application is paused. The gc pause finish, and R1 is answered (Current inflight is 2 now) R4 is polled (current inflight now is 3) , and R5 and R6 are rejected.

What if R6 has arrived 1ms before the GC pause finishs? that request will be rejected even though having a lifetime duration much lower than R4 that will be answered, and it may trigger a timeout in the client side if the GC pause was long enough.

This scenario encourages the solution that you've mentioned before of having a sidecar proxy, because the proxy would've rejected R4,R5 and R6 immediately.

IgorPerikov commented 4 years ago

I'm not sure if a long GC cycle implies that the application is truly struggling.

From my point of view, of course it is (I even had a production incident because of long GC 😄). Surely, long GC is a false positive sensor. Imagine changing your api so it will return much more data (thus, allocating more objects on heap). In that case rtt might go up and system will spend more time on GC, which means that previous static limit no more valid, because operations are more expensive now. If you will let them flow same as before - server will be flooded by work, start slowely degrade, spending more cpu cycles on gc will affect latency and total healthiness, therefore is a signal to refuse exceeding requests and probably lowering limit

martinlocurcio commented 4 years ago

Surely, long GC is a false positive sensor

I think that is the key of my question. One of the things that I've to do for sure is to see if another GC algorithm or even tuning a little bit more CMS improves the overall performance of the application.

But also, I believe that there has to be a way to avoid falling in those false positives. I'm pretty sure that having a embedded solution is not the way to proceed so I would think in a way to implement a proxy/api management gateway for my current arquitecture.

Thanks for sharing your ideas!