apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.5k stars 1.29k forks source link

Pinot Controller leader re-election not triggered #13990

Closed cypherean closed 2 weeks ago

cypherean commented 1 month ago

We faced an issue in production wherein the controller leader node went down but re-election was not triggered. This lead to segment upload errors for tables trying to commit a segment, instances being marked as unavailable for the segment and finally queries failing for the tables with segments unavailable error. Pinot version - 1.0.x

Timeline for this was as follows:

  1. A GET call for a large table's segments' metadata tables/<tablename>/segments/<segmentName>/metadata?columns=* spawns ~75k threads. This cause a huge memory spike and heap to go out of memory (heap size being 128GB here), maybe crashing the node. We suspect it was because of reload status button which triggers the segment metadata call Screenshot 2024-09-12 at 7 11 38 PM

  2. The node's 2 zk sessions time out at 17:19:56, the node tries to reestablish connection but it keeps emitting metrics as a leader until 17:29:x Screenshot 2024-09-12 at 7 11 05 PM

  3. The health check for node starts failing around 17:20:x, but standby controller nodes keep polling and getting failed leader's session ID as leader until 18:10 when we triggered a force replacement of error node Screenshot 2024-09-12 at 7 19 24 PM

Instance 123 is not leader of cluster production-cluster due to current session 702147796290178 does not match leader session 702147796290171

Ideally the re-election should've triggered around 17:20.

Jackie-Jiang commented 1 month ago

Controller leader election is managed by Helix, so you might get more help posting this to Apache Helix project.

On Pinot side, we should triage why a metadata fetch call can spawn 75k threads. It doesn't look correct to me

tibrewalpratik17 commented 1 month ago

On Pinot side, we should triage why a metadata fetch call can spawn 75k threads. It doesn't look correct to me

Yes, we were able to identify the root cause. When fetching server metadata, we spawn one thread per segment. Ref: https://github.com/apache/pinot/blob/266073eee0a56ae811c65cb0828cff294212aa48/pinot-controller/src/main/java/org/apache/pinot/controller/util/ServerSegmentMetadataReader.java#L185-L194

Additionally, we use an unbounded thread pool to handle these segment-level calls, which causes up to 75k threads to be spawned: https://github.com/apache/pinot/blob/266073eee0a56ae811c65cb0828cff294212aa48/pinot-controller/src/main/java/org/apache/pinot/controller/BaseControllerStarter.java#L254-L255

Ideally, we should be making one call per server rather than one call per segment, and we need to limit the number of threads in this pool to a more reasonable number.

Regarding the leader election, we are still investigating why it didn’t trigger when the ZooKeeper client session was lost. We plan to raise an issue in Helix and will link it here. Thank you!

cypherean commented 1 month ago

We'll address the above issues: