Open ankitsultana opened 1 year ago
@ankitsultana this is an really important topic, just added it to the resiliency / robustness parent issue.
What do you think of adding a tasklist and short child issues for the open parts?
for simple syntax and advantages see: https://github.com/apache/pinot/issues/13436#issuecomment-2178829380
The current graceful shutdown steps are listed below in order (along with the issues):
PinotFSFactory
is shutdown first. Issue: This will fail segment upload for the segments that are committing when a server shutdown is happening.SegmentDataManager
. For realtime tables, this attempts to stop the ingestion by joining with the consumer thread for each consuming segment. (no issue with this)All this means that we could run into scenarios where a segment goes into error-state, because the deep-store link for the segment is missing and peer download wouldn't work because server restarts usually take at least 2 minutes or more and current retry logic only waits 10-15s for an Online peer (by default the replica will take 31 seconds to catch-up to the final offset. if it can't, then we try to download the segment instead).
If instead of a server restart it was a host failure, then we could also have data loss.
I am working offline with some stakeholders for the fixes.
Exception seen due to closed ZkClient:
cc: @Jackie-Jiang