apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.26k stars 1.23k forks source link

Pinot Server Graceful Shutdown Improvements #10876

Open ankitsultana opened 1 year ago

ankitsultana commented 1 year ago

The current graceful shutdown steps are listed below in order (along with the issues):

  1. PinotFSFactory is shutdown first. Issue: This will fail segment upload for the segments that are committing when a server shutdown is happening.
  2. HelixManager disconnects which shuts down the ZkClient. Issue: This happens before segment data managers are destroyed, so if there is a segment that tries to do a commit after the disconnect is called, it will throw because the ZkClient is already shutdown.
  3. Server instance shutdown issues destroy for all SegmentDataManager. For realtime tables, this attempts to stop the ingestion by joining with the consumer thread for each consuming segment. (no issue with this)

All this means that we could run into scenarios where a segment goes into error-state, because the deep-store link for the segment is missing and peer download wouldn't work because server restarts usually take at least 2 minutes or more and current retry logic only waits 10-15s for an Online peer (by default the replica will take 31 seconds to catch-up to the final offset. if it can't, then we try to download the segment instead).

If instead of a server restart it was a host failure, then we could also have data loss.

I am working offline with some stakeholders for the fixes.

image

Exception seen due to closed ZkClient:

2023-06-09 03:00:29.552 [some_table__242__1314__20230609T0216Z] ERROR o.a.p.c.d.m.r.LLRealtimeSegmentDataManager_some_table__242__1314__20230609T0216Z  - Exception while in work
java.lang.IllegalStateException: ZkClient already closed!
    at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1977)
    at org.apache.helix.zookeeper.zkclient.ZkClient.readData(ZkClient.java:2139)
    at org.apache.helix.zookeeper.zkclient.ZkClient.readData(ZkClient.java:2131)
    at org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:495)
    at org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.get(ZkCacheBaseDataAccessor.java:397)
    at org.apache.helix.store.zk.AutoFallbackPropertyStore.get(AutoFallbackPropertyStore.java:101)
    at org.apache.pinot.common.metadata.ZKMetadataProvider.getTableConfig(ZKMetadataProvider.java:308)
    at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.replaceLLSegment(RealtimeTableDataManager.java:665)
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.commitSegment(LLRealtimeSegmentDataManager.java:1012)
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:728)
    at java.base/java.lang.Thread.run(Thread.java:829)

cc: @Jackie-Jiang

hpvd commented 2 weeks ago

@ankitsultana this is an really important topic, just added it to the resiliency / robustness parent issue.

What do you think of adding a tasklist and short child issues for the open parts?

for simple syntax and advantages see: https://github.com/apache/pinot/issues/13436#issuecomment-2178829380