apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.31k stars 1.24k forks source link

Commit Failures with 0.12.0 #10277

Open suddendust opened 1 year ago

suddendust commented 1 year ago

Upgraded our OSS cluster to 0.12.0, servers almost immediately start throwing this exception:

2023/02/13 08:42:56.015 ERROR [LLRealtimeSegmentDataManager_service_call_view__13__536__20230213T0820Z] [service_call_view__13__536__20230213T0820Z] Holding after response from Controller: {"offset":-1,"buildTimeSec":-1,"isSplitCommitType":false,"status":"NOT_SENT","streamPartitionMsgOffset":null}
2023/02/13 08:42:56.028 ERROR [ControllerLeaderLocator] [service_call_view__29__542__20230213T0820Z] The partition size of leadControllerResource is not 24. Actual size: 8
2023/02/13 08:42:56.028 WARN [ServerSegmentCompletionProtocolHandler] [service_call_view__29__542__20230213T0820Z] No leader found while trying to send org.apache.pinot.common.protocols.SegmentCompletionProtocol$SegmentConsumedRequest@1ea7a4d0
2023/02/13 08:42:56.028 ERROR [LLRealtimeSegmentDataManager_service_call_view__29__542__20230213T0820Z] [service_call_view__29__542__20230213T0820Z] Holding after response from Controller: {"offset":-1,"buildTimeSec":-1,"isSplitCommitType":false,"status":"NOT_SENT","streamPartitionMsgOffset":null}
2023/02/13 08:42:56.075 ERROR [ControllerLeaderLocator] [service_call_view__37__534__20230213T0820Z] The partition size of leadControllerResource is not 24. Actual size: 8
2023/02/13 08:42:56.075 WARN [ServerSegmentCompletionProtocolHandler] [service_call_view__37__534__20230213T0820Z] No leader found while trying to send org.apache.pinot.common.protocols.SegmentCompletionProtocol$SegmentConsumedRequest@3242f0b7
2023/02/13 08:42:56.075 ERROR [LLRealtimeSegmentDataManager_service_call_view__37__534__20230213T0820Z] [service_call_view__37__534__20230213T0820Z] Holding after response from Controller: {"offset":-1,"buildTimeSec":-1,"isSplitCommitType":false,"status":"NOT_SENT","streamPartitionMsgOffset":null}
2023/02/13 08:42:56.448 ERROR [ControllerLeaderLocator] [service_call_view__5__531__20230213T0820Z] The partition size of leadControllerResource is not 24. Actual size: 8
2023/02/13 08:42:56.448 WARN [ServerSegmentCompletionProtocolHandler] [service_call_view__5__531__20230213T0820Z] No leader found while trying to send org.apache.pinot.common.protocols.SegmentCompletionProtocol$SegmentConsumedRequest@62b70e06
2023/02/13 08:42:56.448 ERROR [LLRealtimeSegmentDataManager_service_call_view__5__531__20230213T0820Z] [service_call_view__5__531__20230213T0820Z] Holding after response from Controller: {"offset":-1,"buildTimeSec":-1,"isSplitCommitType":false,"status":"NOT_SENT","streamPartitionMsgOffset":null}
2023/02/13 08:42:56.472 ERROR [ControllerLeaderLocator] [service_call_view__33__532__20230213T0820Z] The partition size of leadControllerResource is not 24. Actual size: 8

and

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method) ~[?:?]
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:115) ~[?:?]
    at java.net.SocketInputStream.read(SocketInputStream.java:168) ~[?:?]
    at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
    at org.apache.pinot.shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.common.utils.http.HttpClient.sendRequest(HttpClient.java:276) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.common.utils.FileUploadDownloadClient.sendSegmentCompletionProtocolRequest(FileUploadDownloadClient.java:1040) ~[pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.server.realtime.ServerSegmentCompletionProtocolHandler.sendRequest(ServerSegmentCompletionProtocolHandler.java:217) [pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.server.realtime.ServerSegmentCompletionProtocolHandler.segmentConsumed(ServerSegmentCompletionProtocolHandler.java:184) [pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.postSegmentConsumedMsg(LLRealtimeSegmentDataManager.java:1110) [pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:650) [pinot-all-0.12.0-jar-with-dependencies.jar:0.12.0-118f5e065cb258c171d97a586183759fbc61e2bf]
    at java.lang.Thread.run(Thread.java:829) [?:?]
2023/02/13 08:54:15.150 INFO [ControllerLeaderLocator] [span_event_view_1__72__631__20230213T0822Z] Millis since last controller cache value invalidate 27643 is less than allowed frequency 30000. Skipping invalidate.
2023/02/13 08:54:15.150 ERROR [LLRealtimeSegmentDataManager_span_event_view_1__72__631__20230213T0822Z] [span_event_view_1__72__631__20230213T0822Z] Holding after response from Controller: {"offset":-1,"buildTimeSec":-1,"isSplitCommitType":false,"status":"NOT_SENT","streamPartitionMsgOffset":null}

Reverting back to 0.11.0 works.

Jackie-Jiang commented 1 year ago

The error indicates that the controller leader election is failing. That is managed by Helix, but there is no Helix upgrade in 0.12.0. Can you try upgrading the controllers again, and check the leadControllerResource external view to see if it has 24 partitions. If not, try restarting the controller