apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.38k stars 1.26k forks source link

Brokers getting into stuck state when interrupted during OFFLINE -> ONLINE state transition #7976

Open dang-stripe opened 2 years ago

dang-stripe commented 2 years ago

We've noticed a case where brokers get stuck when they're interrupted via SIGTERM when the broker resource is transitioning from OFFLINE to ONLINE states. This seems to leave the broker in a stuck state indefinitely and subsequent SIGTERMs are ignored. We end up needing to kill the process via SIGKILL to recover it. Will pinot/helix retry state transitions on errors like this?

Here's a log we found before this happened:

2022/01/06 00:15:47.311 ERROR [BrokerResourceOnlineOfflineStateModelFactory] [HelixTaskExecutor-message_handle_thread] Caught exception while processing transition from OFFLINE to ONLINE for table: test_table_REALTIME
org.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException
        at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1202) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1336) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1328) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:320) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.helix.manager.zk.ZkCacheBaseDataAccessor.get(ZkCacheBaseDataAccessor.java:390) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.helix.store.zk.AutoFallbackPropertyStore.get(AutoFallbackPropertyStore.java:101) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.pinot.common.metadata.ZKMetadataProvider.getTableConfig(ZKMetadataProvider.java:184) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.pinot.broker.routing.RoutingManager.buildRouting(RoutingManager.java:296) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
        at org.apache.pinot.broker.broker.helix.BrokerResourceOnlineOfflineStateModelFactory$BrokerResourceOnlineOfflineStateModel.onBecomeOnlineFromOffline(BrokerResourceOnlineOfflineStateModelFactory.java:80) [pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
Caused by: java.lang.InterruptedException
    at java.lang.Object.wait(Native Method) ~[?:?]
    at java.lang.Object.wait(Object.java:328) ~[?:?]
    at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
    at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2129) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2160) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
    at org.apache.helix.manager.zk.zookeeper.ZkConnection.readData(ZkConnection.java:136) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
    at org.apache.helix.manager.zk.zookeeper.ZkClient$10.call(ZkClient.java:1340) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
    at org.apache.helix.manager.zk.zookeeper.ZkClient$10.call(ZkClient.java:1336) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
    at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1190) ~[pinot-all-0.9.0-2021-12-23-b649cf300-SNAPSHOT-jar-with-dependencies.jar:0.9.0-2021-12-23-b649cf300-SNAPSHOT-b649cf300ca6ddbff6ddeadb9d4dd97429fac014]
    ... 20 more
Jackie-Jiang commented 2 years ago

Can you please try a thread dump after sending the SIGTERM and see which thread is preventing the broker to be shut down? I suspect the interruption might be swallowed by some thread