apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.25k stars 3.58k forks source link

bookie failed to decommission successfully #15776

Open yebai1105 opened 2 years ago

yebai1105 commented 2 years ago

Describe the bug I have four bookie , one of the bookie process hangs because the disk is full, so I try to clear the data directory and take the service offline. In addition, the number of data copies is three, and the cluster can be used normally when the bookie service hangs up

To Reproduce Steps to reproduce the behavior:

  1. Delete journalDirectories and ledgerDirectories directory data
  2. Execute command:‘ bin/bookkeeper shell listunderreplicated ’
    2022-05-25 16:40:16.0906 [main] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=10.101.129.65:2181,10.101.129.68:2181,10.101.129.70:2181 sessionTimeout=30000 watcher=org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase@6c372fe6
    2022-05-25 16:40:16.0911 [main] INFO  org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
    2022-05-25 16:40:16.0917 [main] INFO  org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 1048575 Bytes
    2022-05-25 16:40:16.0923 [main] INFO  org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=false
    2022-05-25 16:40:16.0932 [main-SendThread(10.101.129.70:2181)] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server kafka-pool-prd-10-101-129-70.v-sz-1.vivo.lan/10.101.129.70:2181.
    2022-05-25 16:40:16.0933 [main-SendThread(10.101.129.70:2181)] INFO  org.apache.zookeeper.ClientCnxn - SASL config status: Will not attempt to authenticate using SASL (unknown error)
    2022-05-25 16:40:16.0937 [main-SendThread(10.101.129.70:2181)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.101.129.65:35556, server: kafka-pool-prd-10-101-129-70.v-sz-1.vivo.lan/10.101.129.70:2181
    2022-05-25 16:40:16.0941 [main-SendThread(10.101.129.70:2181)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server kafka-pool-prd-10-101-129-70.v-sz-1.vivo.lan/10.101.129.70:2181, session id = 0x3063889d93b4263, negotiated timeout = 30000
    2022-05-25 16:40:16.0944 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
    2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606
    2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -        Ctime : 1651199961381
    2022-05-25 16:40:17.0036 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 112963
    2022-05-25 16:40:17.0036 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -        Ctime : 1650363984734
    2022-05-25 16:40:17.0037 [main-SendThread(10.101.129.70:2181)] WARN  org.apache.zookeeper.ClientCnxn - An exception was thrown while closing send thread for session 0x3063889d93b4263.
    org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x3063889d93b4263, likely server has closed socket
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
    2022-05-25 16:40:17.0143 [main] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x3063889d93b4263 closed
    2022-05-25 16:40:17.0143 [main-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3063889d93b4263
  3. Execute command:‘ bin/bookkeeper shell decommissionbookie’
    2022-05-25 16:04:09.0574 [main] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=10.101.129.65:2181,10.101.129.68:2181,10.101.129.70:2181 sessionTimeout=30000 watcher=org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase@69f1a286
    2022-05-25 16:04:09.0579 [main] INFO  org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
    2022-05-25 16:04:09.0585 [main] INFO  org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 1048575 Bytes
    2022-05-25 16:04:09.0592 [main] INFO  org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=false
    2022-05-25 16:04:09.0601 [main-SendThread(10.101.129.68:2181)] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server kafka-pool-prd-10-101-129-68.v-sz-1.vivo.lan/10.101.129.68:2181.
    2022-05-25 16:04:09.0601 [main-SendThread(10.101.129.68:2181)] INFO  org.apache.zookeeper.ClientCnxn - SASL config status: Will not attempt to authenticate using SASL (unknown error)
    2022-05-25 16:04:09.0605 [main-SendThread(10.101.129.68:2181)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.101.129.65:55334, server: kafka-pool-prd-10-101-129-68.v-sz-1.vivo.lan/10.101.129.68:2181
    2022-05-25 16:04:09.0609 [main-SendThread(10.101.129.68:2181)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server kafka-pool-prd-10-101-129-68.v-sz-1.vivo.lan/10.101.129.68:2181, session id = 0x10130f8da34426c, negotiated timeout = 30000
    2022-05-25 16:04:09.0612 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now.
    2022-05-25 16:04:09.0789 [main] ERROR org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to initialize DNS Resolver org.apache.bookkeeper.net.ScriptBasedMapping, used default subnet resolver 
    java.lang.RuntimeException: No network topology script is found when using script based DNS resolver.
        at org.apache.bookkeeper.net.ScriptBasedMapping$RawScriptBasedMapping.validateConf(ScriptBasedMapping.java:163) ~[bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.net.AbstractDNSToSwitchMapping.setConf(AbstractDNSToSwitchMapping.java:81) ~[bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.net.ScriptBasedMapping.setConf(ScriptBasedMapping.java:123) ~[bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.initialize(RackawareEnsemblePlacementPolicyImpl.java:265) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.initialize(RackawareEnsemblePlacementPolicyImpl.java:80) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.BookKeeper.initializeEnsemblePlacementPolicy(BookKeeper.java:581) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.BookKeeper.<init>(BookKeeper.java:505) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.BookKeeper.<init>(BookKeeper.java:344) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.client.BookKeeperAdmin.<init>(BookKeeperAdmin.java:164) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.tools.cli.commands.bookies.DecommissionCommand.decommission(DecommissionCommand.java:91) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.tools.cli.commands.bookies.DecommissionCommand.apply(DecommissionCommand.java:82) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.bookie.BookieShell$DecommissionBookieCmd.runCmd(BookieShell.java:1956) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.bookie.BookieShell$MyCommand.runCmd(BookieShell.java:238) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.bookie.BookieShell.run(BookieShell.java:2278) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
        at org.apache.bookkeeper.bookie.BookieShell.main(BookieShell.java:2369) [bookkeeper-server-4.14.4.1-SNAPSHOT.jar:4.14.4.1-SNAPSHOT]
    2022-05-25 16:04:09.0818 [main] INFO  org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Initialize rackaware ensemble placement policy @ <Bookie:10.101.129.65:0> @ /default-rack : org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy$DefaultResolver.
    2022-05-25 16:04:09.0818 [main] INFO  org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Not weighted
    2022-05-25 16:04:09.0827 [main] INFO  org.apache.bookkeeper.client.BookKeeper - Weighted ledger placement is not enabled
    2022-05-25 16:04:09.0903 [main-EventThread] INFO  org.apache.bookkeeper.discover.ZKRegistrationClient - Update BookieInfoCache (writable bookie) 10.101.129.68:3181 -> BookieServiceInfo{properties={}, endpoints=[EndpointInfo{id=httpserver, port=8070, host=0.0.0.0, protocol=http, auth=[], extensions=[]}, EndpointInfo{id=bookie, port=3181, host=10.101.129.68, protocol=bookie-rpc, auth=[], extensions=[]}]}
    2022-05-25 16:04:09.0903 [main-EventThread] INFO  org.apache.bookkeeper.discover.ZKRegistrationClient - Update BookieInfoCache (writable bookie) 10.101.129.65:3181 -> BookieServiceInfo{properties={}, endpoints=[EndpointInfo{id=httpserver, port=8070, host=0.0.0.0, protocol=http, auth=[], extensions=[]}, EndpointInfo{id=bookie, port=3181, host=10.101.129.65, protocol=bookie-rpc, auth=[], extensions=[]}]}
    2022-05-25 16:04:09.0904 [main-EventThread] INFO  org.apache.bookkeeper.discover.ZKRegistrationClient - Update BookieInfoCache (writable bookie) 10.101.129.70:3181 -> BookieServiceInfo{properties={}, endpoints=[EndpointInfo{id=httpserver, port=8070, host=0.0.0.0, protocol=http, auth=[], extensions=[]}, EndpointInfo{id=bookie, port=3181, host=10.101.129.70, protocol=bookie-rpc, auth=[], extensions=[]}]}
    2022-05-25 16:04:09.0908 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-rack/10.101.129.68:3181
    2022-05-25 16:04:09.0908 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-rack/10.101.129.65:3181
    2022-05-25 16:04:09.0908 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-rack/10.101.129.70:3181
    2022-05-25 16:04:09.0945 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Resetting LostBookieRecoveryDelay value: 0, to kickstart audit task
    2022-05-25 16:05:04.0446 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 10
    2022-05-25 16:06:44.0455 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:06:54.0458 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:04.0460 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:14.0462 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:24.0464 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:34.0465 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:44.0467 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:07:54.0471 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:04.0473 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:14.0474 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:24.0476 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:34.0478 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:44.0480 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:08:54.0482 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:04.0485 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:14.0487 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:24.0489 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:34.0490 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:44.0492 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:09:54.0494 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:10:04.0496 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:10:14.0498 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1
    2022-05-25 16:10:24.0500 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 1

    Continuously prints "Count of Ledgers which need to be rereplicated: 1" for a whole day without ending

System configuration

Pulsar version: 2.9.2 bookeeper:4.14.4

lgxbslgx commented 2 years ago

I can reproduce this bug locally by using the following steps:

  1. Deploy a cluster according to the document. The cluster has 3 zookeeper nodes, 3 bookkeeper nodes and 3 brokers, which is same as the document.
  2. Use the command bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3 to test the bookkeeper. Note: this step is important.
  3. Produce and comsume sevaral times.
  4. Delete journalDirectories and ledgerDirectories directory of one bookie, named BK1. (Same as @yebai1105 's step1)
  5. Shutdown the bookie BK1.
  6. Use command bin/bookkeeper shell listunderreplicated at bookie node BK1. (Same as @yebai1105 's step2, but @yebai1105 didn't indicate which bookie node to use this command.)
  7. Use command bin/bookkeeper shell decommissionbookie at bookie node BK1. (Same as @yebai1105 's step3, but @yebai1105 didn't indicate which bookie node to use this command.)

Then the same error message occurs. It is because the command bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3 create a ledger whose ensemble size is equal to write quorum size and is equal to the number of all the bookie(also 3). So this ledger can't be replicated util another new bookie node is created.

Now I need to confirm from @yebai1105: have you used the similar command, like bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4, to test when you deployed your cluster?

If your don't remember whether you had done this test when you deployed your cluster, you can use the following command to get the nodes of the ledger (such as 396606 your log shows) which is under replicated.

2022-05-25 16:40:17.0035 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606 2022-05-25 16:40:17.0035 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606 2022-05-25 16:40:17.0035 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - Ctime : 1651199961381 2022-05-25 16:40:17.0036 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 112963 2022-05-25 16:40:17.0036 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - Ctime : 1650363984734

// open the zookeeper shell
$ bin/pulsar zookeeper-shell -timeout 5000 -server <zk-ip/zk-domain>:<zk-port>

// get the ledger 396606 which is under replicated
$ get /ledger/00/0039/L6606

// another example 112963
$ get /ledger/00/0011/L2963

You can count the bookie node number of such ledger. If the node number is 4 in your cluster, it means my assumption is right.

yebai1105 commented 2 years ago

Below are answers to some of your questions and our findings: 1、your questions : 1.1 we never used this command:bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4 1.2 command 'bin/bookkeeper shell listunderreplicated' and command'bin/bookkeeper shell decommissionbookie' are executed on the faulty machine 10.101.129.75 1.3 All namespace persistence strategies in our cluster are as follows:

{
  "bookkeeperEnsemble" : 3,
  "bookkeeperWriteQuorum" : 3,
  "bookkeeperAckQuorum" : 2,
  "managedLedgerMaxMarkDeleteRate" : 0.0
}

2、our findings I have four bookies, one bookie 10.101.129.75 process is down because the disk is full, and the other two bookies are also down, leaving only one surviving bookie, but I don't know why these three bookies are down, because I can't find them to the previous log. I suspect that the persistent downtime of the three bookies is causing data loss so that the nodes cannot be retired. In the previous question, we pulled up other machines except 10.101.129.75 before testing, and prepared to delete the directory and decommission 10.101.129.75. Below is my test: 2.1 I listed the ledger with missing copies and the corresponding machines, and found that the leader 396606 was missing on all 4 machines.log see link below: https://tva1.sinaimg.cn/large/e6c9d24egy1h2smsws1u8j225s0rj1c4.jpg 2.2 I tried to use the command 'bin/bookkeeper shell readledger -ledgerid 396606' to read the ledger, and found an error when reading the entry 867, it was actually sending a request to the faulty machine 10.101.129.75.log see link below: https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn01kac1j21de0u0apv.jpg 2.3 I used the command 'bin/bookkeeper shell ledgermetadata -ledgerid 396606' to view the metadata of ledger 396606 and found that the replicas starting from entry845 are allocated on this machine 10.101.129.75.log see link below: https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn21dcobj21gr0u0e22.jpg When our debug node is retired, we find that the source code will read the ledger that lacks a copy, and the error 'no entry' is reported here. https://tva1.sinaimg.cn/large/e6c9d24egy1h2snbxx8xvj212y0np79x.jpg

some doubts: I think the data loss is caused by the fact that the three bookie machines are down in a short time and our cluster parameter journalWriteData is set to false (we don't want to enable journal write to write ahead, we allow some data loss). But I have some doubts, why the loss of data will cause such a big problem that the machine cannot be retired, and even in our later tests, it was found that the loss of the ledger will even cause the producer to fail to send data. Maybe the situation of data loss should be considered here. and have countermeasures @lgxbslgx

yebai1105 commented 2 years ago

In addition, we also want to know the following and to deal with the scenario where the ledger data has been lost, can the parameter journalWriteData be set to false? @lgxbslgx

lgxbslgx commented 2 years ago

@yebai1105 I read the log you provided and have no idea now. And I agree your opinion that such situation should be recovered by pulsar. Maybe we need other more experimented developers to fix it.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

vitalii-buchyn-exa commented 1 year ago

Hello guys,

Having a similar issue.

We've lost a several bookies and need to clean up everything connected with them.

We've tried bookkeeper shell decommissionbookie but it is infinitely runs with a message like Count of Ledgers which need to be rereplicated: 16 We've tried to clean up /ledgers/cookies in zookeeper, restarted brokers, bookies, zookeeper but still see connection errors in bookies logs like:

2023-09-20 11:42:55,992 - ERROR - [BookKeeperClientScheduler-OrderedScheduler-0-0:PerChannelBookieClient@534] - Cannot connect to pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181, bookie does not exist or it is not running

We can see those 16 underreplicated ledgers, ike

[zk: localhost:2181(CONNECTED) 13] ls /ledgers/underreplication/ledgers/0000/0000
[000f, 0013, 0015, 0016, 0019, 001a, 001b, 001f, 0020, 0025, 0028, 002c, 002f, 0035, 0036, 0039]

How we can safely clean up?

UPDATE: so we ended up with deletion of orphaned ledgers, example command:

for id in $(/opt/bookkeeper/bin/bookkeeper shell listledgers -m -bookieid pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181 | grep ledgerID | awk '{ print $4 }'); do /opt/bookkeeper/bin/bookkeeper shell deleteledger -l $id -f; done
truong-hua commented 11 months ago

I have the same problem and my bookie is stuck at waiting for replication. We have 2 ack quorum and 3 write, assemble size. And we have 5 bookies and there is no problem with the server happened since starting time until I ran the first decommission command.