apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.51k stars 1.29k forks source link

Error Masked in Peer Server Segment Finder #14276

Open ankitsultana opened 3 weeks ago

ankitsultana commented 3 weeks ago

We often get error segments due to the inability of one of the replicas to download an online segment from a peer.

And most of the times, we aren't able to see a clear error message for this. Stack trace looks something like the following:

org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 5 attempts
     at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
     at org.apache.pinot.core.util.PeerServerSegmentFinder.getPeerServerURIs(PeerServerSegmentFinder.java:81)
     at org.a
pache.pinot.core.util.PeerServerSegmentFinder.getPeerServerURIs(PeerServerSegmentFinder.java:67)
     at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.lambda$downloadSegmentFromPeer$4(RealtimeTableData
Manager.java:666)
     at org.apache.pinot.common.utils.fetcher.BaseSegmentFetcher.lambda$fetchSegmentToLocal$2(BaseSegmentFetcher.java:127)
     at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java
:50)
     at org.apache.pinot.common.utils.fetcher.BaseSegmentFetcher.fetchSegmentToLocal(BaseSegmentFetcher.java:126)
     at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.downloadSegmentFromPeer(Realti
meTableDataManager.java:663)
     at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.downloadAndReplaceSegment(RealtimeTableDataManager.java:606)
     at org.apache.pinot.core.data.manager.realtime.Realtim
eSegmentDataManager.downloadSegmentAndReplace(RealtimeSegmentDataManager.java:1294)
     at org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager.goOnlineFromConsuming(RealtimeSegmentDataManager.java:1233)

The corresponding code at the commit from which this build was made is shown below. From a operational experience point of view, I think we need the following improvements here:

  1. If the predicate is onlineServers.isEmpty() is true after all attempts, the logs should clearly indicate that this was the reason for the attempt exhaustion.
  2. There should be some way to log the last instance state map seen for this segment during the retries. This can help in knowing the exact EV the Servers were seeing at the time of the failure.
  3. If an exception is thrown in getOnlineServersFromExternalView, that exception should be clearly logged. Maybe this is already happening?
image
ankitsultana commented 2 weeks ago

@alguiguilo098 : yup go ahead.