apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.26k stars 3.59k forks source link

Failed to build bookie server in kubernetes environment #7653

Open walterEri opened 4 years ago

walterEri commented 4 years ago

Describe the bug we have deployed pulsar using helm chart in openshfit k8s environment. when we scale up the deleted volume bookie we encounter with below issue.

2.284: Total time for which application threads were stopped: 0.0001663 seconds, Stopping threads took: 0.0001040 seconds
10:28:55.441 [main] ERROR org.apache.bookkeeper.bookie.Bookie - There are directories without a cookie, and this is neither a new environment, nor is storage expansion enabled. Empty directories are [data/bookkeeper/journal/current, data/bookkeeper/ledgers/current]
10:28:55.441 [main] INFO  org.apache.bookkeeper.proto.BookieNettyServer - Shutting down BookieNettyServer
2.404: Total time for which application threads were stopped: 0.0002793 seconds, Stopping threads took: 0.0001671 seconds
2.405: Total time for which application threads were stopped: 0.0002462 seconds, Stopping threads took: 0.0001874 seconds
2.406: Total time for which application threads were stopped: 0.0002902 seconds, Stopping threads took: 0.0001068 seconds
2.413: Total time for which application threads were stopped: 0.0005099 seconds, Stopping threads took: 0.0003770 seconds
2.413: Total time for which application threads were stopped: 0.0001357 seconds, Stopping threads took: 0.0000875 seconds
2.415: Total time for which application threads were stopped: 0.0004179 seconds, Stopping threads took: 0.0001708 seconds
2.417: Total time for which application threads were stopped: 0.0001661 seconds, Stopping threads took: 0.0000410 seconds
2.418: Total time for which application threads were stopped: 0.0003534 seconds, Stopping threads took: 0.0002680 seconds
2.419: Total time for which application threads were stopped: 0.0003882 seconds, Stopping threads took: 0.0003552 seconds
2.419: Total time for which application threads were stopped: 0.0000419 seconds, Stopping threads took: 0.0000138 seconds
2.420: Total time for which application threads were stopped: 0.0000399 seconds, Stopping threads took: 0.0000124 seconds
10:28:55.461 [main] ERROR org.apache.bookkeeper.server.Main - Failed to build bookie server
org.apache.bookkeeper.bookie.BookieException$InvalidCookieException: 
    at org.apache.bookkeeper.bookie.Bookie.checkEnvironmentWithStorageExpansion(Bookie.java:468) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
    at org.apache.bookkeeper.bookie.Bookie.checkEnvironment(Bookie.java:250) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
    at org.apache.bookkeeper.bookie.Bookie.<init>(Bookie.java:688) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
    at org.apache.bookkeeper.proto.BookieServer.newBookie(BookieServer.java:136) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
    at org.apache.bookkeeper.proto.BookieServer.<init>(BookieServer.java:105) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
    at org.apache.bookkeeper.server.service.BookieService.<init>(BookieService.java:41) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
    at org.apache.bookkeeper.server.Main.buildBookieServer(Main.java:301) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
    at org.apache.bookkeeper.server.Main.doMain(Main.java:221) [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
    at org.apache.bookkeeper.server.Main.main(Main.java:203) [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
    at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:313) [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
2.425: Total time for which application threads were stopped: 0.0000542 seconds, Stopping threads took: 0.0000156 seconds

To Reproduce Steps to reproduce the behavior :

  1. scale down the bookie pod.
  2. delete persistent volume claim and persistent volume of deleted bookie pod.
  3. scale up the bookie pod

Expected behavior scale up the bookie pod should not cause the problem.

sijie commented 4 years ago

@walterEri It seems that the bookie is using an OLD volume that contains OLD data. You need to format the disk if you re-use an OLD volume for a new bookie.

walterEri commented 4 years ago

In my scenario i have 3 bookie and each pod attached to separate pv and pvc. Then i scaled down to 2 pod and deleted pv and pvc for the 3rd pod. so we don't have OLD volume of 3rd pod.

Then again i have scaled up to 3 bookie. so it's creating a new pod (3rd) and attaching new pv and pvc to the 3rd pod. even though i am getting the same error.

sijie commented 4 years ago

I see. @walterEri when you scale down, you need to use bin/bookkeeper shell decommissionbookie command to decommission the removed bookie before scaling up.

walterEri commented 4 years ago

i tried decommission bookie but it's ran more than one day so i killed the process

walterEri commented 4 years ago

@sijie still my decommission the bookie is not fixed. it's taking very long time to execute. is any other way to decommission the bookie

sijie commented 4 years ago

You need to scale up the auto-recovery job. Because decommission a bookie requires re-replicating the entries that are originally stored in that bookie.

truong-hua commented 2 years ago

Let's say that in an unexpected accident that we lost all of the bookie data. Is there any option to force delete relative ledgers and force decommission the bookie?

Hongten commented 1 year ago

@walterEri It seems that the bookie is using an OLD volume that contains OLD data. You need to format the disk if you re-use an OLD volume for a new bookie.

Hi @sijie , could you advise how to check the OLD volume or OLD data? Thanks.