InterruptedException when acquiring partitonConsumerSemaphore

apache / pinot

Apache Pinot - A realtime distributed OLAP datastore

https://pinot.apache.org/

Apache License 2.0

5.38k stars 1.26k forks source link

InterruptedException when acquiring partitonConsumerSemaphore #12390

Open eaugene opened 6 months ago

eaugene commented 6 months ago

We are encountering the following issue when OFFLINE-> ONLINE transition of the segment

Caught exception in state transition from OFFLINE -> ONLINE for resource: <TableName>, partition:<Segment> 
java.lang.RuntimeException: InterruptedException when acquiring the partitionConsumerSemaphore for segment: rta_eater_rtp_wav_baf_prod__0__4985__20240209T0153Z
    at o.a.p.c.d.m.r.LLRealtimeSegmentDataManager.<init>(LLRealtimeSegmentDataManager.java:1432)
    at o.a.p.c.d.m.r.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:380)
    at o.a.p.s.s.h.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:221)
    at o.a.p.s.s.h.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:161)
    at o.a.p.s.s.h.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:83)
    at j.i.r.GeneratedMethodAccessor1558.invoke(Unknown Source)
    at j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:566)
    at o.a.h.m.h.HelixStateTransitionHandler...

Pinot Version : 0.12

cc : @ankitsultana

ankitsultana commented 6 months ago

The exact issue is described below (all of these are confirmed via logs):

Server has a full GC which leads to ZK and Helix disconnection.
When ZK reconnects, a bunch of OFFLINE to CONSUMING messages are sent
We see the exception above.

Current Theory: When ZK disconnection happens, the PartitionConsumer thread is still alive and holding the semaphore, and so when Helix sends OFFLINE to CONSUMING transition again for that segment, the Segment Data Manager fails to init.

I am low on time right now so can't dig deeper. Wondering if anyone can hint at some potential solutions.

I had also seen this somewhat related issue from a few years ago: https://github.com/apache/pinot/issues/7874

cc: @Jackie-Jiang

ankitsultana commented 6 months ago

Took a deep-dive today and found the root-cause. Here's the sequence of events which triggers this:

We have a consuming segment on a server. On time of commit, it receives a DISCARD from the controller.
There's a Huge GC which triggers ZK disconnection and hence Helix disconnection.
After reconnect, instead of receiving a CONSUMING to ONLINE transition, the segment receives a OFFLINE to ONLINE transition.
Since the TDM already has a Segment Data Manager (SDM) of the consuming segment, we end up skipping the addSegment call and mark it succeeded anyways. RealtimeSegmentDataManager.java#L387

So the RealtimeSegmentDataManager is never destroyed and the _partitionGroupConsumerSemaphore is never released.

As more and more segments for this partition are committed on other servers, we receive many OFFLINE to CONSUMING transitions, which pile up.

When there's another big GC, and say Helix reconnects, then the HelixTaskExecutor will be shutdown, leading to interruption of all the semaphore acquire calls and the original error message above. (note onBecomeConsumingFromOffline in the stack trace).

@Jackie-Jiang : this is an additional case where RealtimeTableDataManager#addSegment can be called. Do you have any recommendations on how to handle this?

Jackie-Jiang commented 6 months ago

@ankitsultana Can you check why the server received OFFLINE to ONLINE state transition instead of CONSUMING to ONLINE state transition? Our current state model doesn't consider this scenario, so we should try to find out why Helix sent this state transition when the segment is actually in CONSUMING state. Did it receive a CONSUMING to OFFLINE state transition before OFFLINE to ONLINE?

ankitsultana commented 6 months ago

Ah my bad, I wasn't clear in the description above. Soon after the segment receives a DISCARD from the controller, there's a Helix disconnection. And a few seconds later the server reconnects and is back online.

My understanding is that after Helix reconnects, the segment will show up as OFFLINE and hence trigger the OFFLINE to ONLINE transition.

Sampled some logs which show this:

{"@timestamp":"2024-02-12T11:17:02.203+00:00","message":"Controller response {\"buildTimeSec\":-1,\"isSplitCommitType\":false,\"streamPartitionMsgOffset\":null,\"offset\":-1,\"status\":\"DISCARD\"} for http://redacted:1234/segmentConsumed?
reason=timeLimit&streamPartitionMsgOffset=368526984&instance=redacted&offset=-1&name=redacted__44__3696__20240212T1016Z&rowCount=48937&memoryUsedBytes=80637706","logger_name":"org.apache.pinot.server.realtime
.ServerSegmentCompletionProtocolHandler","thread_name":"redacted__44__3696__20240212T1016Z","level":"INFO"}
....
{"@timestamp":"2024-02-12T11:17:03.801+00:00","message":"Unable to reconnect to ZooKeeper service, session 0x500e8e168947a76 has expired","logger_name":"org.apache.zookeeper.ClientCnxn","thread_name":"main-SendThread(redacted:2181)","level":"WARN"}

Jackie-Jiang commented 6 months ago

Can you check the log and see if server started from OFFLINE state after re-connecting to Helix? I think it is very possible because when re-connecting to Helix, server will generate a new hash, and start from a new current state. If that is the case, that explains why we observed server not able to consume after re-connecting to Helix. We can probably add a test to reproduce this

ankitsultana commented 6 months ago

I am not well versed with Helix, but I am kinda sure that the server started from OFFLINE state.

There's this log coming from HelixStateMachineEngine: Resetting HelixStateMachineEngine.

This is logged in the reset method, which does this:

    loopStateModelFactories(stateModel -> {
      stateModel.reset();
      String initialState = _stateModelParser.getInitialState(stateModel.getClass());
      stateModel.updateState(initialState);
    });

The initial state is taken from the annotation of the class. Initial state for all Pinot state models is OFFLINE.

Moreover, after this is logged, there's a ton of messages with format: Operating on table_name:segment_name (one for each segment).

@Jackie-Jiang : lmk if this is sufficient.

Jackie-Jiang commented 6 months ago

This is great finding! Let me create an issue to track this