apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.12k stars 3.57k forks source link

auto recovery error log : Corrupted frame received from bookie, under replicated can not covery #15464

Open kwenzh opened 2 years ago

kwenzh commented 2 years ago

Describe the bug there are a pulsar cluster in k8s

We decided to migrate bookie to a new bookie cluster,
1, we set auto recovery is enabled

  1. deploy a new bookie statefulset normally
  2. Reduced number of instances one by one .
  3. i can see recovery is working

    old bookie cluster has 5 pod (0-4), first i close bookie-4 but in latest , there are some ledgerId are not migrate successful,

bin/bookkeeper shell listunderreplicated


02:27:18.605 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
02:27:18.720 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 90129
02:27:18.721 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748446769
02:27:18.722 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 93326
02:27:18.723 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748452235
02:27:18.724 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 90190
02:27:18.724 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748446993
02:27:18.726 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 93508
02:27:18.726 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748452461
02:27:18.728 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 90955
02:27:18.728 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748448866
02:27:18.730 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 95981
02:27:18.730 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748457162
02:27:18.732 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 92099
02:27:18.732 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748450587
02:27:18.734 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 92461
02:27:18.734 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748451050
02:27:18.735 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 98373
02:27:18.736 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748426002
02:27:18.737 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 98613
02:27:18.737 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -    Ctime : 1651748426602
02:27:18.739 [main-SendThread(pulsar-cluster-zookeeper:2181)] WARN  org.apache.zookeeper.ClientCnxn - An exception was thrown while closing send thread for session 0x2000018322a00a4.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x2000018322a00a4, likely server has closed socket
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
02:27:18.854 [main-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x2000018322a00a4
02:27:18.854 [main] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x2000018322a00a4 closed

auto recovery log:

02:37:31.367 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:37:31.369 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:38:47.113 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:38:47.115 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:39:10.731 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:39:10.734 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:39:10.736 [bookkeeper-io-3-4] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0xb94a56fe, L:/10.244.81.13:45666 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.736 [bookkeeper-io-3-4] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0xb94a56fe, L:/10.244.81.13:45666 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.737 [bookkeeper-io-3-1] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0x0030dc6a, L:/10.244.81.13:44790 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.737 [bookkeeper-io-3-1] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0x0030dc6a, L:/10.244.81.13:44790 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.747 [bookkeeper-io-3-4] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181
02:39:10.747 [bookkeeper-io-3-1] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181
02:39:10.747 [bookkeeper-io-3-4] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0xb94a56fe, L:/10.244.81.13:45666 ! R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.747 [bookkeeper-io-3-1] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0x0030dc6a, L:/10.244.81.13:44790 ! R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.748 [bookkeeper-io-3-2] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0xd8237aad, L:/10.244.81.13:45670 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.748 [bookkeeper-io-3-3] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0x71fcf2a7, L:/10.244.81.13:44794 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.748 [bookkeeper-io-3-2] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0xd8237aad, L:/10.244.81.13:45670 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.748 [bookkeeper-io-3-3] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0x71fcf2a7, L:/10.244.81.13:44794 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.763 [bookkeeper-io-3-2] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181
02:39:10.763 [bookkeeper-io-3-2] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0xd8237aad, L:/10.244.81.13:45670 ! R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.764 [bookkeeper-io-3-3] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181
02:39:10.764 [bookkeeper-io-3-3] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0x71fcf2a7, L:/10.244.81.13:44794 ! R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.765 [bookkeeper-io-3-4] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0x7021b4ff, L:/10.244.81.13:45674 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.765 [bookkeeper-io-3-4] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0x7021b4ff, L:/10.244.81.13:45674 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.773 [bookkeeper-io-3-4] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181
02:39:10.773 [bookkeeper-io-3-4] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0x7021b4ff, L:/10.244.81.13:45674 ! R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.773 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L90190 E0 from bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181
02:39:10.774 [bookkeeper-io-3-1] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0x0c5a059e, L:/10.244.81.13:44798 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.774 [bookkeeper-io-3-1] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0x0c5a059e, L:/10.244.81.13:44798 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.783 [bookkeeper-io-3-1] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181
02:39:10.783 [bookkeeper-io-3-1] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0x0c5a059e, L:/10.244.81.13:44798 ! R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.784 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L90190 E0 from bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181
02:39:10.784 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L90190 E0-E0, Sent to [pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181], Heard from [] : bitset = {}, Error = 'Bookie handle is not available'. First unread entry is (-1, rc = null)
02:39:10.784 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Received error: -8 while trying to read entry: 0 of ledger: 90190 in ReplicationWorker
02:39:10.784 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Failed to read faulty entries, so giving up replicating ledgerFragment Fragment(LedgerID: 90190, FirstEntryID: 0[0], LastKnownEntryID: 0[0], Host: [pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181], Closed: true)
02:39:10.784 [bookkeeper-io-3-2] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0xacb8bca9, L:/10.244.81.13:45680 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.784 [bookkeeper-io-3-2] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0xacb8bca9, L:/10.244.81.13:45680 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.784 [bookkeeper-io-3-3] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0x42c42991, L:/10.244.81.13:44804 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.784 [bookkeeper-io-3-3] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0x42c42991, L:/10.244.81.13:44804 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.793 [bookkeeper-io-3-2] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181
02:39:10.793 [bookkeeper-io-3-2] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0xacb8bca9, L:/10.244.81.13:45680 ! R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.799 [bookkeeper-io-3-3] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181
02:39:10.799 [bookkeeper-io-3-3] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0x42c42991, L:/10.244.81.13:44804 ! R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.800 [bookkeeper-io-3-4] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0xda6ef22c, L:/10.244.81.13:45684 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.800 [bookkeeper-io-3-4] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0xda6ef22c, L:/10.244.81.13:45684 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.800 [bookkeeper-io-3-1] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0x7eb38ed7, L:/10.244.81.13:44808 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.800 [bookkeeper-io-3-1] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0x7eb38ed7, L:/10.244.81.13:44808 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:10.808 [bookkeeper-io-3-1] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181
02:39:10.808 [bookkeeper-io-3-1] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0x7eb38ed7, L:/10.244.81.13:44808 ! R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:10.850 [bookkeeper-io-3-4] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181
02:39:10.850 [bookkeeper-io-3-4] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0xda6ef22c, L:/10.244.81.13:45684 ! R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:10.850 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker failed to replicate Ledger : 90190 for 32 number of times, so deferring the ledger lock release by 300000 msecs
02:39:14.886 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:39:14.888 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:39:30.519 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:39:30.522 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
02:39:30.524 [bookkeeper-io-3-2] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0xa9665a84, L:/10.244.81.13:45976 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181]
02:39:30.524 [bookkeeper-io-3-2] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0xa9665a84, L:/10.244.81.13:45976 - R:pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.74.151:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:30.524 [bookkeeper-io-3-3] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 [id: 0x0628fbc9, L:/10.244.81.13:45100 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181]
02:39:30.524 [bookkeeper-io-3-3] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0x0628fbc9, L:/10.244.81.13:45100 - R:pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local/10.244.178.149:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
02:39:30.527 [ReplicationWorker] INFO  org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve pulsar-cluster-bookie-4.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle is not available
02:39:30.527 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to pulsar-cluster-bookie-4.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId pulsar-cluster-bookie-4.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, bookie does not exist or it is not running
02:39:30.527 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Could not connect to bookie: null/pulsar-cluster-bookie-4.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, current state CONNECTING :
org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId pulsar-cluster-bookie-4.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, bookie does not exist or it is not running
        at org.apache.bookkeeper.client.DefaultBookieAddressResolver.resolve(DefaultBookieAddressResolver.java:63) ~[org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.proto.PerChannelBookieClient.connect(PerChannelBookieClient.java:531) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.proto.PerChannelBookieClient.connectIfNeededAndDoOp(PerChannelBookieClient.java:657) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.obtain(DefaultPerChannelBookieClientPool.java:121) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.obtain(DefaultPerChannelBookieClientPool.java:116) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.proto.BookieClientImpl.readEntry(BookieClientImpl.java:509) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.proto.BookieClientImpl.readEntry(BookieClientImpl.java:495) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.proto.BookieClientImpl.readEntry(BookieClientImpl.java:489) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.client.PendingReadOp.sendReadTo(PendingReadOp.java:576) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.sendNextRead(PendingReadOp.java:405) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.read(PendingReadOp.java:386) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.client.PendingReadOp.initiate(PendingReadOp.java:530) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.client.PendingReadOp.safeRun(PendingReadOp.java:540) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.client.LedgerHandle.readEntriesInternalAsync(LedgerHandle.java:896) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.client.LedgerHandle.asyncReadEntriesInternal(LedgerHandle.java:800) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.client.LedgerHandle.asyncReadEntries(LedgerHandle.java:694) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.replication.ReplicationWorker.tryReadingFaultyEntries(ReplicationWorker.java:326) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:378) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:277) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238) [org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.66.Final.jar:4.1.66.Final]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle is not available
        at org.apache.bookkeeper.discover.ZKRegistrationClient.getBookieServiceInfo(ZKRegistrationClient.java:248) ~[org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        at org.apache.bookkeeper.client.DefaultBookieAddressResolver.resolve(DefaultBookieAddressResolver.java:43) ~[org.apache.bookkeeper-bookkeeper-server-4.14.2.jar:4.14.2]
        ... 22 more
02:39:30.527 [BookKeeperClientWorker-OrderedExecutor-1-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L92461 E2060 from bookie: pulsar-cluster-bookie-4.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181
02:39:30.528 [ReplicationWorker] INFO  org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve pulsar-cluster-bookie-4.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle is not available

i notice the main error log:
ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Corrupted frame received from bookie: ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L95981 E210-E210, Sent to [pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181], Heard from [] : bitset = {}, Error = 'Bookie handle is not available'. First unread entry is (-1, rc = null)

the ledger id metadata

2:49:10.484 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - ledgerID: 95981
02:49:10.490 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=3, writeQuorumSize=2, ackQuorumSize=2, state=CLOSED, length=67946172, lastEntryId=2037, digestType=CRC32C, password=base64:, ensembles={0=[pulsar-cluster-bookie-3.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, pulsar-cluster-bookie-2.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181, pulsar-cluster-bookie-4.pulsar-cluster-bookie.pulsar.svc.cluster.local:3181]}, customMetadata={component=base64:bWFuYWdlZC1sZWRnZXI=, pulsar/managed-ledger=base64:Y29sbGVjdC9kZWZhdWx0L3BlcnNpc3RlbnQvaW9hX2xvZy1wYXJ0aXRpb24tMg==, application=base64:cHVsc2Fy}}

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior recovery log is normally

Screenshots

image

image

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

ByrsH commented 2 years ago

I had a similar problem,broker log :

[BookKeeperClientWorker-OrderedExecutor-1-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L173395 E0-E0, Sent to [pulsar-iot-bookie-1.pulsar-iot-bookie.pulsar-prod.svc.cluster.local:3181, pulsar-iot-bookie-0.pulsar-iot-bookie.pulsar-prod.svc.cluster.local:3181], Heard from [] : bitset = {}, Error = 'No such ledger exists on Bookies'. First unread entry is (-1, rc = null)

Describe : there are a pulsar cluster in k8s,pulsar version 2.8.2。The purpose is replace new disk.

operating steps: 1、run bin/bookkeeper shell bookieformat -deleteCookie -f 2、restart pod 3、wait bookie auto sync data to new disk

business service run normal, but when restart business service, Throws the following exception:

Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [so.dian.clover.pulsar.PulsarProducer]: Factory method 'createPulsarProducer' threw exception; nested exception is org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException: org.apache.pulsar.broker.service.schema.exceptions.SchemaException: No such ledger exists on Bookies -  ledger=173395 - operation=Failed to read entry - entry=0 caused by org.apache.pulsar.broker.service.schema.exceptions.SchemaException: No such ledger exists on Bookies -  ledger=173395 - operation=Failed to read entry - entry=0
        at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:185)
        at org.springframework.beans.factory.support.ConstructorResolver.instantiate(ConstructorResolver.java:651)
        ... 41 common frames omitted
github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.