apache / bookkeeper

Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads
https://bookkeeper.apache.org/
Apache License 2.0
1.91k stars 905 forks source link

Bookkeeper High CPU usage when StreamStorageLifecycleComponent extra component is active #2216

Open aschiazza opened 4 years ago

aschiazza commented 4 years ago

Hi, I've a trouble with Bookkeeper v4.9.2 ( I'm using Pulsar Docker image v2.4.1). When I activate the StreamStorageLifecycleComponent adding this line

extraServerComponents=org.apache.bookkeeper.stream.server.StreamStorageLifecycleComponent

in bookkeeper.conf, the bookkeeper container is using 100% CPU.

In my test environment I have 1 Zookeeper server and 1 Bookkeeper bookie.

I report my bookkeeper.conf configuration section for table service

# enable table service api for pulsar function state management
extraServerComponents=org.apache.bookkeeper.stream.server.StreamStorageLifecycleComponent
ignoreExtraServerComponentsStartupFailures=false

##################################################################
##################################################################
# Settings below are used by stream/table service
##################################################################
##################################################################

### Grpc Server ###

# the grpc server port to listen on. default is 4181
storageserver.grpc.port=4181

### Dlog Settings for table service ###

#### Replication Settings
dlog.bkcEnsembleSize=1
dlog.bkcWriteQuorumSize=1
dlog.bkcAckQuorumSize=1

Thanks for your help

eolivelli commented 4 years ago

Are you able to take a dump of the stacktrace of the JVM? With jstack for instance

aschiazza commented 4 years ago

Thanks for the reply.

I have attached the JVM stacktrace and ps output.

jstack_bookkeeper.txt ps_bookkeeper.txt

eolivelli commented 4 years ago

It looks like a problem with some leader electiin. I can't tell more now (I don't have my laptop) Is the bookie able to talk to zookeeper?

aschiazza commented 4 years ago

I think bookie is able to talk to zookeeper. I test it in that way: 1- I attached a shell in bookkeeper container. 2 - started a zookeeper shell (bin/pulsar zookeeper-shell -server zookeeper:2181) 3 - in zk shell I typed ls /stream/servers/available 4 - result is [172.25.0.6:4181] ( that is the IP of bookie container)

Can I test in some other way?

aschiazza commented 4 years ago

Any suggestions?

sijie commented 4 years ago

@aschiazza do you have the log of the bookies?

aschiazza commented 4 years ago

Hi, thanks for the reply. I attached the log of the bookie. bookkeeper_log.txt

zyllt commented 4 years ago

@aschiazza Hi,I started StreamStorageLifecycleComponent according to your configuration, but it is 127.0.0.1:4181in zk,I tried many methods without success. Is there any other configuration? my pulsar version is 2.5.0.

aschiazza commented 4 years ago

Hi @zyllt, I've attached my bookkeeper conf file. bookkeeper.conf.txt I'm still using version 2.4.1 for my project.

in functions_worker.yml conf file I've added this line stateStorageServiceUrl: bk://bookkeeper:4181

I'm using docker-compose for my environment and the command for bookie container is /pulsar/bin/bookkeeper bookie

I hope it will help you.

zyllt commented 4 years ago

@aschiazza Thanks for your reply. I start bookie bin/bookkeeper bookie use your bookkeeper conf file without docker ,i type ls /stream/servers/available at zk ,it is still 127.0.0.1:4181. but when i use docker-compose for my environment and the cmd is bin/pulsar standalone,the output of bin/pulsar zookkeeper-shell && ls /stream/servers/available is [172.17.0.2:4181]. that is the IP of bookie container. I am very confused but I can't find the source code to get ip when registering zk. @sijie Can you give me some suggestions?

aschiazza commented 4 years ago

If you start pulsar in standalone mode all components (broker, bookie zk) start in a single container and registered to the loopback interface. You should start each component (broker, bookie, zk) with proper command, for example /pulsar/bin/pulsar zookeeper for zk /pulsar/bin/bookkeeper bookiefor bookie /pulsar/bin/pulsar brokerfor pulsar broker /pulsar/bin/pulsar proxyfor pulsar proxy /pulsar/bin/pulsar functions-worker for functions worker

zyllt commented 4 years ago

If you start pulsar in standalone mode all components (broker, bookie zk) start in a single container and registered to the loopback interface. You should start each component (broker, bookie, zk) with proper command, for example /pulsar/bin/pulsar zookeeper for zk /pulsar/bin/bookkeeper bookiefor bookie /pulsar/bin/pulsar brokerfor pulsar broker /pulsar/bin/pulsar proxyfor pulsar proxy /pulsar/bin/pulsar functions-worker for functions worker

@aschiazza thanks for your reply.May be my previous expression was inaccurate. I really mean when I start bookie bin/bookkeeper bookie use your bookkeeper conf file without docker,my environment is product. because my product environment do not support docker,i start bookie with docker in my local environment.
First I test bin/pulsar standalone command with docker,the output of bin/pulsar zookkeeper-shell && ls /stream/servers/available is [172.17.0.2:4181].
Second i start each component (broker, bookie, zk) use bin/pulsar zookeeper and bin/bookkeeper bookie command with docker, the input is [172.17.0.4:4181].
Last i test each component (broker, bookie, zk) in my local environment without docker,but input is [127.0.0.1:4181].
172.17.0.2 and 172.17.0.4 are the IP of bookiecontainer.

aschiazza commented 4 years ago

@zyllt in your local environment how do you start components? with which command? if in you start different components on same machine (without docker), the ip address for each components is the same (more specifically all ip addresses set to machine interfaces or 127.0.0.1)

zyllt commented 4 years ago

@aschiazza hi,thanks in advance for your reply,below I will describe my problem and test steps in detail.
First I followed the steps below when I started pulsar in my local environment(without docker),
1.command bin/pulsar-daemon start zookeeper for zk 2.bin/pulsar-daemon start bookie for bookkeeper(I use your bookie conf In addition to the zkServers) 3.bin/pulsar-daemon start broker for broker(I start function-worker with broker by functionsWorkerEnabled=true) I typed bin/pulsar zookeeper-shell && ls /stream/servers/available,the result is [127.0.0.1:4181].
I started demo function WordCountFunction and typed bin/pulsar-admin functions trigger --fqfn test/test-namespace/WordCountFunction --trigger-value "hello pulsar hello wolrd",
then i got successful result when i use bin/pulsar-admin functions querystate --fqfn test/test-namespace/WordCountFunction --key hello.


Second i test use docker-compose in my local environment.
I started each components use docker.I got [172.17.0.3:4181]by use ls /stream/servers/available. then i started demo function WordCountFunction and trigger it,the result is success.


Third i test in my product environment without docker.
I followed the First step exactly to started the product environment.The difference is that broker and bookie are on different machines. I typed bin/pulsar zookeeper-shell && ls /stream/servers/available,the result still is [127.0.0.1:4181].
then i started demo function WordCountFunction,i find this function did not start successfully,the log shows that the startup process is parked at line org.apache.bookkeeper.clients.impl.channel.StorageServerChannelManager - Added range server (hostname: "127.0.0.1" port: 4181 ) into the channel manager.
I suspect that function and StreamStorageServer cannot establish a connection use (hostname: "127.0.0.1" port: 4181 ),because the function and StreamStorageServer are not on the same machine.
I must state I had set stateStorageServiceUrl: bk://10.1.0.112:4181 in config file functions_worker.yml .That 10.1.0.112is the IP of my bookie machine. And I see through the netstat -ant|grep 4181 command that the function machine and bookie machine have established a connection. I think the hostname:127.0.0.1 should be obtained in zk,but i can't be sure.
What I can confirm so far is that this 127.0.01:4181in zk is definitely incorrect.because When I started another bookie machine, StreamStorageServer reminded me that it was already registered.
So I wonder how to register the real server IP, not the 127.0.0.1 , in zookeeper? Any ideas?


Here is the detailed function startup log

19:27:11.455 [test/test-namespace/WordCountFunction-0] INFO  org.apache.pulsar.functions.instance.JavaInstanceRunnable - Starting Java Instance WordCountFunction :
 Details = tenant: "test"
namespace: "test-namespace"
name: "WordCountFunction"
className: "org.apache.pulsar.functions.api.examples.WordCountFunction"
userConfig: "{\"PublishTopic\":\"test_result\"}"
autoAck: true
parallelism: 1
source {
  typeClassName: "java.lang.String"
  inputSpecs {
    key: "test/test-namespace/test_src"
    value {
    }
  }
  cleanupSubscription: true
}
sink {
  topic: "test/test-namespace/test_result"
  typeClassName: "java.lang.Void"
}
resources {
  cpu: 1.0
  ram: 1073741824
  disk: 10737418240
}
componentType: FUNCTION

19:27:11.455 [test/test-namespace/WordCountFunction-0] INFO  org.apache.pulsar.functions.instance.JavaInstanceRunnable - Load JAR: /usr/local/pulsar-2.5.0/download/pulsar_functions/test/test-namespace/WordCountFunction/0/pulsar-functions-api-examples.jar
19:27:11.467 [test/test-namespace/WordCountFunction-0] INFO  org.apache.pulsar.functions.instance.JavaInstanceRunnable - Initialize function class loader for function WordCountFunction at function cache manager
19:27:11.920 [client-scheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.clients.impl.channel.StorageServerChannelManager - Added range server (hostname: "127.0.0.1"
port: 4181
) into the channel manager.
aschiazza commented 4 years ago

Hi @zyllt can you post or attach your bookkeeper configuration file?

zyllt commented 4 years ago

hi @aschiazza I've attached my bookkeeper conf file.

bookkeeper.conf.txt

aschiazza commented 4 years ago

@zyllt How many bookie nodes do you have? I think I read only one. With these parameters:

dlog.bkcEnsembleSize=3 dlog.bkcWriteQuorumSize=2 dlog.bkcAckQuorumSize=2

you require a cluster ensamble composed by 3 bookie nodes (where segments are spread), and you require an ack quorum of 2. If you have only one bookie node change these values to 1.

Another suggestion: Check bookkeeper logs. When it is starting up in logs you should be able to see all configuration parameters read from conf file.

I've attached a log example bookkeeper.logs.txt