apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.23k stars 3.58k forks source link

Failed to write entry (11, 0): Bookie handle is not available #2879

Closed wzdushu closed 6 years ago

wzdushu commented 6 years ago

Hi: I am currently deploying a multi-cluster on bare metal. Reference example:https://pulsar.apache.org/docs/en/deploy-bare-metal-multi-cluster/ my pulsar version: 2.2.0 os release: CentOS Linux release 7.5.1804 (Core) my cluster node: server057.opsadmin.com server058.opsadmin.com server059.opsadmin.com

Running the bookiesanity command for: /ssd/pulsar/bin/bookkeeper shell bookiesanity The following exception occurred: 18:06:38.861 [main-EventThread] INFO org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now. 18:06:39.178 [main] INFO org.apache.bookkeeper.client.BookKeeper - Weighted ledger placement is not enabled 18:06:39.396 [main-EventThread] INFO org.apache.bookkeeper.client.LedgerCreateOp - Ensemble: [server057.opsadmin:3181] for ledger: 11 18:06:39.397 [main] INFO org.apache.bookkeeper.bookie.BookieShell - Created ledger 11 18:06:39.796 [bookkeeper-io-3-1] INFO org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: [id: 0xc246b03c, L:/172.20.10.57:41056 - R:server057.opsadmin/172.20.10.57:3181] 18:06:39.810 [bookkeeper-io-3-1] INFO org.apache.bookkeeper.proto.PerChannelBookieClient - Successfully connected to bookie: server057.opsadmin:3181 18:06:39.837 [bookkeeper-io-3-1] INFO org.apache.bookkeeper.proto.PerChannelBookieClient - connection [id: 0xc246b03c, L:/172.20.10.57:41056 - R:server057.opsadmin/172.20.10.57:3181] authenticated as BookKeeperPrincipal{ANONYMOUS} 18:06:39.870 [bookkeeper-io-3-1] INFO org.apache.bookkeeper.proto.PerChannelBookieClient - Disconnected from bookie channel [id: 0xc246b03c, L:/172.20.10.57:41056 ! R:server057.opsadmin/172.20.10.57:3181] 18:06:39.881 [BookKeeperClientWorker-OrderedExecutor-11-0] WARN org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (11, 0): Bookie handle is not available 18:06:39.903 [BookKeeperClientWorker-OrderedExecutor-11-0] ERROR org.apache.bookkeeper.client.LedgerHandle - Could not get additional bookie to remake ensemble, closing ledger: 11 18:06:39.903 [BookKeeperClientWorker-OrderedExecutor-11-0] ERROR org.apache.bookkeeper.client.LedgerHandle - Closing ledger 11 due to NotEnoughBookiesException: Not enough non-faulty bookies available 18:06:39.904 [BookKeeperClientWorker-OrderedExecutor-11-0] ERROR org.apache.bookkeeper.client.PendingAddOp - Write of ledger entry to quorum failed: L11 E0 18:06:39.905 [main] WARN org.apache.bookkeeper.bookie.BookieShell - Error in bookie sanity test org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2] at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2] at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2] at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2] at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1543) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2] at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:520) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_191] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-all-4.1.22.Final.jar:4.1.22.Final] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_191] 18:06:39.926 [main] INFO org.apache.bookkeeper.bookie.BookieShell - Deleted ledger 11 18:06:39.926 [main] INFO org.apache.bookkeeper.proto.PerChannelBookieClient - Closing the per channel bookie client for server057.opsadmin.com:3181 18:06:39.961 [main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x3002bd8b1840012 closed 18:06:39.961 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3002bd8b1840012

sijie commented 6 years ago

@nanguawu :

where did you run bookkeeper sanity? was it in the bookie machine?

can you run bookkeeper shell listbookies -rw to get the list of bookies alive in the cluster?

Closing ledger 11 due to NotEnoughBookiesException: Not enough non-faulty bookies available

This error message indicates that there is no bookies alive. I am suspecting that bookies advertised a non-reachable address.

wzdushu commented 6 years ago

The zookeeper that comes with pulsar is normal, always suggesting that bookie does not exist. cluster node: server057.opsadmin.com

[root@server057 pulsar]# ./bin/bookkeeper shell listbookies -rw 11:33:40.444 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 11:33:40.444 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp 11:33:40.444 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.compiler= 11:33:40.444 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux 11:33:40.444 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64 11:33:40.444 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.version=3.10.0-862.9.1.el7.x86_64 11:33:40.445 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.name=root 11:33:40.445 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.home=/root 11:33:40.445 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/ssd/pulsar 11:33:40.447 [main] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=server057.opsadmin.com:2181,server058.opsadmin.com:2181,server059.opsadmin.com:2181 sessionTimeout=30000 watcher=org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase@363a52f 11:33:40.479 [main-SendThread(server058.opsadmin.com:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server server058.opsadmin.com/172.20.10.58:2181. Will not attempt to authenticate using SASL (unknown error) 11:33:40.494 [main-SendThread(server058.opsadmin.com:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to server058.opsadmin.com/172.20.10.58:2181, initiating session 11:33:40.503 [main-SendThread(server058.opsadmin.com:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server server058.opsadmin.com/172.20.10.58:2181, sessionid = 0x20304ff08cf0013, negotiated timeout = 30000 11:33:40.514 [main-EventThread] INFO org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is connected now. No bookie exists! 11:33:40.545 [main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x20304ff08cf0013 closed 11:33:40.545 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x20304ff08cf0013

sijie commented 6 years ago

yeah. it seems that bookie doesn't advertised itself to zookeeper. do you mind sharing the logs of bookies? @nanguawu

wzdushu commented 6 years ago

Bookie server startup log has been added to the attachment

pulsar-bookie-server057.opsadmin.cn.log

sijie commented 6 years ago

@nanguawu it seems that the bookie doesn't finish starting up. is your bookie still running? can you take a jstack trace? run jstack <pid>?

wzdushu commented 6 years ago

But the bookie process is already up,the attachment is the result of the implementation of "jstack "

[root@server057 ~]# ps -ef |grep bookie root 20119 1 0 Oct30 ? 00:01:19 /opt/jdk1.8.0_191/bin/java -cp /ssd/pulsar/conf:::/ssd/pulsar/lib/*: -Dlog4j.configurationFile=log4j2.yaml -Djute.maxbuffer=10485760 -Djava.net.preferIPv4Stack=true -Xms2g -Xmx2g -XX:MaxDirectMemorySize=4g -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.maxCapacity.default=1000 -Dio.netty.recycler.linkCapacity=1024 -Dpulsar.log.appender=RollingFile -Dpulsar.log.dir=/ssd/pulsar/logs -Dpulsar.log.level=info -Dpulsar.routing.appender.default=Console -Dpulsar.functions.process.container.log.dir=/ssd/pulsar/logs -Dpulsar.functions.java.instance.jar=/ssd/pulsar/instances/java-instance.jar -Dpulsar.functions.python.instance.file=/ssd/pulsar/instances/python-instance/python_instance_main.py -Dpulsar.log.file=pulsar-bookie-server057.opsadmin.com.log org.apache.bookkeeper.proto.BookieServer --conf /ssd/pulsar/conf/bookkeeper.conf [root@server057 ~]# jstack 20119 > /tmp/20119.jstack.log 20119.jstack.log

sijie commented 6 years ago

@nanguawu :

But the bookie process is already up

the process is running, but from the jstack, I didn't seem any journal thread BookieJournal seems it doesn't complete startup.

did you have find anything more in the logs?

Can you restart one bookie and share me with the new log?

wzdushu commented 6 years ago

https://github.com/apache/pulsar/files/2529415/pulsar-bookie-server057.opsadmin.cn.log

There is a log i restarted . I have provided before. Below is another log file

cat /ssd/pulsar/logs/pulsar-bookie-server057.opsadmin.com.out

Exception in thread "main" java.lang.RuntimeException: java.net.BindException: Address already in use at org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider.start(PrometheusMetricsProvider.java:126) at org.apache.bookkeeper.server.service.StatsProviderService.doStart(StatsProviderService.java:51) at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:78) at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$start$2(LifecycleComponentStack.java:113) at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:408) at org.apache.bookkeeper.common.component.LifecycleComponentStack.start(LifecycleComponentStack.java:113) at org.apache.bookkeeper.common.component.ComponentStarter.startComponent(ComponentStarter.java:80) at org.apache.bookkeeper.server.Main.doMain(Main.java:227) at org.apache.bookkeeper.server.Main.main(Main.java:201) at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:280) Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:433) at sun.nio.ch.Net.bind(Net.java:425) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:317) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:235) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.eclipse.jetty.server.Server.doStart(Server.java:390) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider.start(PrometheusMetricsProvider.java:123) ... 9 more [root@server057 pulsar]# cat data/bookkeeper/ledgers/current/VERSION 4 bookieHost: "server057.opsadmin.com:3181" journalDir: "data/bookkeeper/journal" ledgerDirs: "1\tdata/bookkeeper/ledgers" instanceId: "f10937c4-71dc-4512-bfe5-067fa7d43bf8" [root@server057 pulsar]# cat data/bookkeeper/journal/current/VERSION 4 bookieHost: "server057.opsadmin.com:3181" journalDir: "data/bookkeeper/journal" ledgerDirs: "1\tdata/bookkeeper/ledgers" instanceId: "f10937c4-71dc-4512-bfe5-067fa7d43bf8" [root@server057 pulsar]# egrep -v '^#|^$' conf/bookkeeper.conf bookiePort=3181 journalDirectory=data/bookkeeper/journal minUsableSizeForIndexFileCreation=1073741824 advertisedAddress=server057.opsadmin.com allowLoopback=false bookieDeathWatchInterval=1000 flushInterval=60000 useHostNameAsBookieID=false gcWaitTime=900000 gcOverreplicatedLedgerWaitTime=86400000 numAddWorkerThreads=0 numReadWorkerThreads=8 numHighPriorityWorkerThreads=8 maxPendingReadRequestsPerThread=2500 maxPendingAddRequestsPerThread=10000 auditorPeriodicBookieCheckInterval=86400 rereplicationEntryBatchSize=5000 openLedgerRereplicationGracePeriod=30 autoRecoveryDaemonEnabled=true lostBookieRecoveryDelay=0 serverTcpNoDelay=true journalMaxSizeMB=2048 journalMaxBackups=5 journalPreAllocSizeMB=16 journalWriteBufferSizeKB=64 journalRemoveFromPageCache=true journalSyncData=true journalAdaptiveGroupWrites=true journalMaxGroupWaitMSec=1 journalBufferedWritesThreshold=524288 numJournalCallbackThreads=8 journalAlignmentSize=4096 journalFlushWhenQueueEmpty=false ledgerStorageClass=org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage ledgerDirectories=data/bookkeeper/ledgers auditorPeriodicCheckInterval=604800 openFileLimit=0 pageLimit=0 ledgerManagerFactoryClass=org.apache.bookkeeper.meta.HierarchicalLedgerManagerFactory zkLedgersRootPath=/ledgers logSizeLimit=1073741824 entryLogFilePreallocationEnabled=true readBufferSizeBytes=4096 writeBufferSizeBytes=65536 compactionRate=1000 minorCompactionThreshold=0.2 minorCompactionInterval=3600 compactionMaxOutstandingRequests=100000 majorCompactionThreshold=0.5 majorCompactionInterval=86400 isThrottleByBytes=false compactionRateByEntries=1000 compactionRateByBytes=1000000 statsProviderClass=org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider prometheusStatsHttpPort=8000 readOnlyModeEnabled=true diskUsageThreshold=0.95 diskCheckInterval=10000 zkServers=server057.opsadmin.com:2181,server058.opsadmin.com:2181,server059.opsadmin.com:2181 zkTimeout=30000 zkEnableSecurity=false dbStorage_writeCacheMaxSizeMb=512 dbStorage_readAheadCacheMaxSizeMb=256 dbStorage_readAheadCacheBatchSize=1000 dbStorage_rocksDB_blockCacheSize=268435456 dbStorage_rocksDB_writeBufferSizeMB=64 dbStorage_rocksDB_sstSizeInMB=64 dbStorage_rocksDB_blockSize=65536 dbStorage_rocksDB_bloomFilterBitsPerKey=10 dbStorage_rocksDB_numLevels=-1 dbStorage_rocksDB_numFilesInLevel0=4 dbStorage_rocksDB_maxSizeInLevel1MB=256

wzdushu commented 6 years ago

This is the data that the zookeeper got bookies.

ode : /bookies data : b'{}' stat : ZnodeStat(czxid=21474836493, mzxid=21474836493, ctime=1540801542043, mtime=1540801542043, version=0, cversion=0, aversion=0, ephemeralOwn er=0, dataLength=2, numChildren=0, pzxid=21474836493)child : []

sijie commented 6 years ago

@nanguawu

cat /ssd/pulsar/logs/pulsar-bookie-server057.opsadmin.com.out

Exception in thread "main" java.lang.RuntimeException: java.net.BindException: Address already in use at

it seems that bookie fails to start due to it can't bind to prometheus port. are you running other components in the same port 8000?

sijie commented 6 years ago

@nanguawu

can you run lsof -i :8000 to see what process is using 8000?

wzdushu commented 6 years ago

8000 this port should be the zookeeper port that comes with pulsar。 I seem to know where the prolem is ,Thank you very much! [root@server057 pulsar]# lsof -i:8000 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 33839 root 215u IPv4 163379008 0t0 TCP :irdmi (LISTEN) [root@server057 pulsar]# ps -ef |grep 33839 root 2799 1152 0 09:55 pts/0 00:00:00 grep --color=auto 33839 root 33839 1 0 Oct29 ? 00:07:12 /opt/jdk1.8.0_191/bin/java -Dzookeeper.4lw.commands.whitelist= -cp /ssd/pulsar/conf:::/ssd/pulsar/lib/*: -Dlog4j.configurationFile=log4j2.yaml -Djute.maxbuffer=10485760 -Djava.net.preferIPv4Stack=true -Xms2g -Xmx2g -XX:MaxDirectMemorySize=4g -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.maxCapacity.default=1000 -Dio.netty.recycler.linkCapacity=1024 -Dpulsar.log.appender=RollingFile -Dpulsar.log.dir=/ssd/pulsar/logs -Dpulsar.log.level=info -Dpulsar.routing.appender.default=Console -Dpulsar.functions.process.container.log.dir=/ssd/pulsar/logs -Dpulsar.functions.java.instance.jar=/ssd/pulsar/instances/java-instance.jar -Dpulsar.functions.python.instance.file=/ssd/pulsar/instances/python-instance/python_instance_main.py -Dpulsar.log.file=pulsar-zookeeper-server057.opsadmin.com.log org.apache.pulsar.zookeeper.ZooKeeperStarter /ssd/pulsar/conf/zookeeper.conf

Solution: modify the monitoring port opened by bookeeper to prometheus。

[root@server057 pulsar]#sed -i 's/prometheusStatsHttpPort\=8000/prometheusStatsHttpPort\=8900/' conf/bookkeeper.conf [root@server057 pulsar]#./bin/pulsar-daemon stop bookie [root@server057 pulsar]#./bin/pulsar-daemon start bookie [root@server057 pulsar]#/ssd/pulsar/bin/bookkeeper shell bookiesanity
10:04:28.128 [main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x3002bd8b1840022 closed 10:04:28.128 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3002bd8b1840022 10:04:28.128 [main] INFO org.apache.bookkeeper.bookie.BookieShell - Bookie sanity test succeeded

sijie commented 6 years ago

@nanguawu that's awesome. glad that you resolved the problem :) feel free to ping us if you have any other issues.