apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.2k stars 1.21k forks source link

Fix race condition in AdaptiveServerSelection and misc fixes #13104

Closed vvivekiyer closed 3 weeks ago

vvivekiyer commented 3 weeks ago

Contains 2 fixes

1. Adaptive Server Selection - race condition:

A race condition between jetty threads and netty threads can result in setting negative values for numInFlightRequests for servers. This can result in that particular server being overloaded when compared to other.

It's difficult to reproduce this but the race-condition is obvious from code-reading.

The race condition is explained below Let's say a query is routed to 2 servers S1 and S2. Say the query has a timeout of 1s. The race condition timeline is as follows: T1: Query is routed to S1 and S2. The ADSS stats will look as follows: S1 Stats = { numInFlightRequests = 1 } S2 Stats = { numInFlightRequests = 1 }

T2: S1 responds with the results (dataTable). The ADSS stats will be updated to look as follows. Note that this update is by the netty thread that receives the response. S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = 1 }

T3: Let's say the query timed out. The jetty thread will update the ADSS stats for S2 as per code to look as follows: S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = 0 }

T4: Before the jetty thread removes the QueryResponse object for the request, the server S2 could respond and the corresponding netty thread would update the ADSS stats incorrectly to look as follows S1 Stats = { numInFlightRequests = 0 } S2 Stats = { numInFlightRequests = -1 }

2. Updates client error list to add a few more exceptions.

codecov-commenter commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 50.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 62.11%. Comparing base (59551e4) to head (b2724d6). Report is 416 commits behind head on master.

Files Patch % Lines
...pache/pinot/core/transport/AsyncQueryResponse.java 50.00% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #13104 +/- ## ============================================ + Coverage 61.75% 62.11% +0.36% + Complexity 207 198 -9 ============================================ Files 2436 2514 +78 Lines 133233 137786 +4553 Branches 20636 21319 +683 ============================================ + Hits 82274 85583 +3309 - Misses 44911 45787 +876 - Partials 6048 6416 +368 ``` | [Flag](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [custom-integration1](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (-0.01%)` | :arrow_down: | | [integration](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (-0.01%)` | :arrow_down: | | [integration1](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (-0.01%)` | :arrow_down: | | [integration2](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (ø)` | | | [java-11](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.11% <50.00%> (+0.40%)` | :arrow_up: | | [java-21](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (-61.63%)` | :arrow_down: | | [skip-bytebuffers-false](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.11% <50.00%> (+0.36%)` | :arrow_up: | | [skip-bytebuffers-true](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (-27.73%)` | :arrow_down: | | [temurin](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.11% <50.00%> (+0.36%)` | :arrow_up: | | [unittests](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.10% <50.00%> (+0.36%)` | :arrow_up: | | [unittests1](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `46.77% <50.00%> (-0.12%)` | :arrow_down: | | [unittests2](https://app.codecov.io/gh/apache/pinot/pull/13104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.74% <0.00%> (+0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jadami10 commented 1 week ago

@vvivekiyer thank you for working on this fix! This actually bit us a month ago (only time in ~2 years), but we restarted the brokers before grabbing the routing stats, so we couldn't root cause.