Open yebai1105 opened 2 years ago
I can reproduce this bug locally by using the following steps:
bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3
to test the bookkeeper. Note: this step is important.BK1
. (Same as @yebai1105 's step1)BK1
. bin/bookkeeper shell listunderreplicated
at bookie node BK1
. (Same as @yebai1105 's step2, but @yebai1105 didn't indicate which bookie node to use this command.)bin/bookkeeper shell decommissionbookie
at bookie node BK1
. (Same as @yebai1105 's step3, but @yebai1105 didn't indicate which bookie node to use this command.)Then the same error message occurs. It is because the command bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3
create a ledger whose ensemble size is equal to write quorum size and is equal to the number of all the bookie(also 3). So this ledger can't be replicated util another new bookie node is created.
Now I need to confirm from @yebai1105: have you used the similar command, like bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4
, to test when you deployed your cluster?
If your don't remember whether you had done this test when you deployed your cluster, you can use the following command to get the nodes of the ledger (such as 396606
your log shows) which is under replicated.
2022-05-25 16:40:17.0035 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606 2022-05-25 16:40:17.0035 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606 2022-05-25 16:40:17.0035 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - Ctime : 1651199961381 2022-05-25 16:40:17.0036 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 112963 2022-05-25 16:40:17.0036 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - Ctime : 1650363984734
// open the zookeeper shell
$ bin/pulsar zookeeper-shell -timeout 5000 -server <zk-ip/zk-domain>:<zk-port>
// get the ledger 396606 which is under replicated
$ get /ledger/00/0039/L6606
// another example 112963
$ get /ledger/00/0011/L2963
You can count the bookie node number of such ledger. If the node number is 4 in your cluster, it means my assumption is right.
Below are answers to some of your questions and our findings: 1、your questions : 1.1 we never used this command:bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4 1.2 command 'bin/bookkeeper shell listunderreplicated' and command'bin/bookkeeper shell decommissionbookie' are executed on the faulty machine 10.101.129.75 1.3 All namespace persistence strategies in our cluster are as follows:
{
"bookkeeperEnsemble" : 3,
"bookkeeperWriteQuorum" : 3,
"bookkeeperAckQuorum" : 2,
"managedLedgerMaxMarkDeleteRate" : 0.0
}
2、our findings I have four bookies, one bookie 10.101.129.75 process is down because the disk is full, and the other two bookies are also down, leaving only one surviving bookie, but I don't know why these three bookies are down, because I can't find them to the previous log. I suspect that the persistent downtime of the three bookies is causing data loss so that the nodes cannot be retired. In the previous question, we pulled up other machines except 10.101.129.75 before testing, and prepared to delete the directory and decommission 10.101.129.75. Below is my test: 2.1 I listed the ledger with missing copies and the corresponding machines, and found that the leader 396606 was missing on all 4 machines.log see link below: https://tva1.sinaimg.cn/large/e6c9d24egy1h2smsws1u8j225s0rj1c4.jpg 2.2 I tried to use the command 'bin/bookkeeper shell readledger -ledgerid 396606' to read the ledger, and found an error when reading the entry 867, it was actually sending a request to the faulty machine 10.101.129.75.log see link below: https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn01kac1j21de0u0apv.jpg 2.3 I used the command 'bin/bookkeeper shell ledgermetadata -ledgerid 396606' to view the metadata of ledger 396606 and found that the replicas starting from entry845 are allocated on this machine 10.101.129.75.log see link below: https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn21dcobj21gr0u0e22.jpg When our debug node is retired, we find that the source code will read the ledger that lacks a copy, and the error 'no entry' is reported here. https://tva1.sinaimg.cn/large/e6c9d24egy1h2snbxx8xvj212y0np79x.jpg
some doubts: I think the data loss is caused by the fact that the three bookie machines are down in a short time and our cluster parameter journalWriteData is set to false (we don't want to enable journal write to write ahead, we allow some data loss). But I have some doubts, why the loss of data will cause such a big problem that the machine cannot be retired, and even in our later tests, it was found that the loss of the ledger will even cause the producer to fail to send data. Maybe the situation of data loss should be considered here. and have countermeasures @lgxbslgx
In addition, we also want to know the following and to deal with the scenario where the ledger data has been lost, can the parameter journalWriteData be set to false? @lgxbslgx
@yebai1105 I read the log you provided and have no idea now. And I agree your opinion that such situation should be recovered by pulsar. Maybe we need other more experimented developers to fix it.
The issue had no activity for 30 days, mark with Stale label.
Hello guys,
Having a similar issue.
We've lost a several bookies and need to clean up everything connected with them.
We've tried bookkeeper shell decommissionbookie
but it is infinitely runs with a message like Count of Ledgers which need to be rereplicated: 16
We've tried to clean up /ledgers/cookies in zookeeper, restarted brokers, bookies, zookeeper but still see connection errors in bookies logs like:
2023-09-20 11:42:55,992 - ERROR - [BookKeeperClientScheduler-OrderedScheduler-0-0:PerChannelBookieClient@534] - Cannot connect to pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181, bookie does not exist or it is not running
We can see those 16 underreplicated ledgers, ike
[zk: localhost:2181(CONNECTED) 13] ls /ledgers/underreplication/ledgers/0000/0000
[000f, 0013, 0015, 0016, 0019, 001a, 001b, 001f, 0020, 0025, 0028, 002c, 002f, 0035, 0036, 0039]
How we can safely clean up?
UPDATE: so we ended up with deletion of orphaned ledgers, example command:
for id in $(/opt/bookkeeper/bin/bookkeeper shell listledgers -m -bookieid pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181 | grep ledgerID | awk '{ print $4 }'); do /opt/bookkeeper/bin/bookkeeper shell deleteledger -l $id -f; done
I have the same problem and my bookie is stuck at waiting for replication. We have 2 ack quorum and 3 write, assemble size. And we have 5 bookies and there is no problem with the server happened since starting time until I ran the first decommission command.
Describe the bug I have four bookie , one of the bookie process hangs because the disk is full, so I try to clear the data directory and take the service offline. In addition, the number of data copies is three, and the cluster can be used normally when the bookie service hangs up
To Reproduce Steps to reproduce the behavior:
Continuously prints "Count of Ledgers which need to be rereplicated: 1" for a whole day without ending
System configuration
Pulsar version: 2.9.2 bookeeper:4.14.4