apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.23k stars 3.58k forks source link

[Bug] bookie-2 is not able to recover after lossing the filesystem #23047

Open cccdemon opened 3 months ago

cccdemon commented 3 months ago

Search before asking

Read release policy

Version

Kubernetes 1.29 Pulsar 3.2.3

Minimal reproduce step

scale down to 1 bookies delete the related pvc/pv for bookie-2

What did you expect to see?

a full recover by 1 bookie

What did you see instead?

2024-07-17T10:10:23,047+0000 [main] ERROR org.apache.bookkeeper.bookie.LegacyCookieValidation - There are directories without a cookie, and this is neither a new environment, nor is storage expansion enabled. Empty directories are [/pulsar/data/bookkeeper/journal/current, /pulsar/data/bookkeeper/ledgers/current] 2024-07-17T10:10:23,048+0000 [main] ERROR org.apache.bookkeeper.server.Main - Failed to build bookie server org.apache.bookkeeper.bookie.BookieException$InvalidCookieException: at org.apache.bookkeeper.bookie.LegacyCookieValidation.checkCookies(LegacyCookieValidation.java:113) ~[org.apache.bookkeeper-bookkeeper-server-4.16.5.jar:4.16.5] at org.apache.bookkeeper.server.EmbeddedServer$Builder.build(EmbeddedServer.java:408) ~[org.apache.bookkeeper-bookkeeper-server-4.16.5.jar:4.16.5] at org.apache.bookkeeper.server.Main.buildBookieServer(Main.java:277) ~[org.apache.bookkeeper-bookkeeper-server-4.16.5.jar:4.16.5] at org.apache.bookkeeper.server.Main.doMain(Main.java:216) ~[org.apache.bookkeeper-bookkeeper-server-4.16.5.jar:4.16.5] at org.apache.bookkeeper.server.Main.main(Main.java:199) ~[org.apache.bookkeeper-bookkeeper-server-4.16.5.jar:4.16.5]

Anything else?

No response

Are you willing to submit a PR?

vonsch commented 1 month ago

Hello, we also encountered this issue in our environment. We have HA setup with three bookies across three availability zones and we lost bookie storage/disk in one of the availability zones (deleted kubernetes PVC+PV). When new bookie was started and new blank kubernetes PV was auto-provisioned, bookie failed to start.

We were able to recover it without destroying the whole pulsar deployment by manual removal of broken bookie from the cluster and then restarting the bookie POD:

# kubectl -n pulsar exec -it pulsar-bookie-0 -- /bin/bash # connect to any functional bookie POD
# ./bin/bookkeeper shell listbookies -a # Get proper BookieID from output
# ./bin/bookkeeper shell decommissionbookie -bookieid pulsar-bookie-1.pulsar-bookie.pulsar.svc.cluster.local:3181 # example