Closed vvivekiyer closed 3 weeks ago
Attention: Patch coverage is 50.00000%
with 1 lines
in your changes are missing coverage. Please review.
Project coverage is 62.11%. Comparing base (
59551e4
) to head (b2724d6
). Report is 416 commits behind head on master.
Files | Patch % | Lines |
---|---|---|
...pache/pinot/core/transport/AsyncQueryResponse.java | 50.00% | 0 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@vvivekiyer thank you for working on this fix! This actually bit us a month ago (only time in ~2 years), but we restarted the brokers before grabbing the routing stats, so we couldn't root cause.
Contains 2 fixes
1. Adaptive Server Selection - race condition:
A race condition between jetty threads and netty threads can result in setting negative values for numInFlightRequests for servers. This can result in that particular server being overloaded when compared to other.
It's difficult to reproduce this but the race-condition is obvious from code-reading.
The race condition is explained below Let's say a query is routed to 2 servers S1 and S2. Say the query has a timeout of 1s. The race condition timeline is as follows: T1: Query is routed to S1 and S2. The ADSS stats will look as follows:
S1 Stats = { numInFlightRequests = 1 } S2 Stats = { numInFlightRequests = 1 }
T2: S1 responds with the results (dataTable). The ADSS stats will be updated to look as follows. Note that this update is by the netty thread that receives the response.
S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = 1 }
T3: Let's say the query timed out. The jetty thread will update the ADSS stats for S2 as per code to look as follows:
S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = 0 }
T4: Before the jetty thread removes the QueryResponse object for the request, the server S2 could respond and the corresponding netty thread would update the ADSS stats incorrectly to look as follows
S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = -1 }
2. Updates client error list to add a few more exceptions.