Open ankitsultana opened 1 year ago
@ankitsultana Have you tried resetting such segments?
I haven't tested that yet but I think reset will take the segment offline which may cause the segment to be skipped from queries altogether or it may cause the queries to fail (segment unavailable exception).
Usually this issue happens for us in one of the replicas of the segment so it doesn't impact in-flight queries.
Reset allows reseting a segment exclusively on a particular server too by setting the targetInstance parameter in the /segments/{tableNameWithType}/{segmentName}/reset
API
I see. Thanks we can try it out the next time (I also need to read a bit about this).
Regardless though, I think we should try to fix the state transition as well.
Here are the difference for these similar terms: https://docs.pinot.apache.org/basics/getting-started/frequent-questions/operations-faq#whats-the-difference-to-reset-refresh-or-reload-a-segment
For reload, it is not performed using the state transition. We can consider adding a controller periodic task to automatically resetting the error segments. IIRC, we don't always do error -> offline reset because it might run into infinite loading for bad segment
@Jackie-Jiang : Any concerns in making the zk call? We actually make a zk call anyways to get the metadata in reloadSegmentWithMetadata
if the segment is not a mutable one.
@Jackie-Jiang : Any concerns in making the zk call? We actually make a zk call anyways to get the metadata in
reloadSegmentWithMetadata
if the segment is not a mutable one.
@ankitsultana Since reload doesn't follow the regular state transition (it is a custom message), re-download segment won't bring the segment back to ONLINE
state. It will cause inconsistency between server current state and segment status.
I have seen this behavior quite often in our systems where a segment would go into error state in one of the servers and a reload doesn't fix the issue. However if we restart the server then the segment becomes healthy again. On checking the logs, I often see something like this:
This is the corresponding code:
https://github.com/apache/pinot/blob/3772b55dc4c35673762a182b2ee650469560aa97/pinot-server/src/main/java/org/apache/pinot/server/starter/helix/HelixInstanceDataManager.java#L277
I was wondering that if we can't find the segment metadata locally can we fetch it from ZK? Also is there a way where the server can auto-recover from such a situation?
One of the cases where I have seen this issue happen is when there's a server restart and an inflight
onBecomeConsumingFromOffline
is killed. When the server comes back up, I only see that it logs that this segment is in error inServiceStatus
.