cloudstax / firecamp

Serverless Platform for the stateful services
https://www.cloudstax.io
Apache License 2.0
210 stars 20 forks source link

Kafka Zookeeper logs show connection refused #11

Closed jazzl0ver closed 6 years ago

jazzl0ver commented 6 years ago

I used Cloudformation template firecamp-existingvpc to roll out the environment. Then created a zookeeper service according to the instruction (for Kafka). Here is what is in the Cloudwatch (firecamp-qa-zoo-qa) logs:

2017-12-01 08:20:31,031 [myid:3] - INFO [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumPeer$QuorumServer@167] - Resolved hostname: zoo-qa-1.firecamp-qa-firecamp.com to address: zoo-qa-1.firecamp-qa-firecamp.com/172.22.2.62
2017-12-01 08:21:31,029 [myid:3] - INFO [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumCnxManager$Listener@746] - Received connection request /172.22.2.62:59352
2017-12-01 08:21:31,030 [myid:3] - WARN [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumCnxManager@588] - Cannot open channel to 2 at election address zoo-qa-1.firecamp-qa-firecamp.com/172.22.2.62:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:479)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:379)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:757)
2017-12-01 08:21:31,033 [myid:3] - INFO [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumPeer$QuorumServer@167] - Resolved hostname: zoo-qa-1.firecamp-qa-firecamp.com to address: zoo-qa-1.firecamp-qa-firecamp.com/172.22.2.62
2017-12-01 08:22:31,031 [myid:3] - INFO [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumCnxManager$Listener@746] - Received connection request /172.22.2.62:59356
2017-12-01 08:22:31,032 [myid:3] - WARN [zoo-qa-2.firecamp-qa-firecamp.com/172.22.5.201:3888:QuorumCnxManager@588] - Cannot open channel to 2 at election address zoo-qa-1.firecamp-qa-firecamp.com/172.22.2.62:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:562)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:479)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:379)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:757)

Is that OK?

I see that 172.22.2.62 has an established connection with 3rd zookeeper instance (1.119):

[root@ip-172-22-2-62 ec2-user]# netstat -anp|grep 3888
tcp        0      0 ::ffff:172.22.2.62:59908    ::ffff:172.22.5.201:3888    TIME_WAIT   -
tcp        0      0 ::ffff:172.22.2.62:52050    ::ffff:172.22.1.119:3888    ESTABLISHED 3637/java
JuniusLuo commented 6 years ago

The connection refused may happen if the peer container is not initialized yet, the connection will be refused. This would also happen for zookeeper on ec2 directly, if the peer ec2 is not initialized yet.

The connection refused would disappear after all containers are running. Was the zookeeper cluster successfully initialized after that? If so, we could just ignore these error messages.

jazzl0ver commented 6 years ago

It appears after some time it works w/o issues and such log entries. I apologies for the false alarm.

JuniusLuo commented 6 years ago

Any potential issue is welcome :) Please feel free to report. Thanks!

jazzl0ver commented 6 years ago

Found out what's going on. How to reproduce:

  1. Create the firecamp stack with 3 t2.small instances
  2. Create the services:
    ./firecamp-service-cli -op=create-service -service-type=zookeeper -region=us-east-1 -cluster=firecamp-qa -service-name=zoo-qa -replicas=3 -volume-size=20 -zk-heap-size=512
    ./firecamp-service-cli -op=create-service -service-type=kafka -region=us-east-1 -cluster=firecamp-qa -replicas=3 -volume-size=100 -service-name=kafka-qa -kafka-zk-service=zoo-qa -kafka-heap-size=512
    ./firecamp-service-cli -op=create-service -service-type=cassandra -region=us-east-1 -cluster=firecamp-qa -service-name=cass-qa -replicas=3 -volume-size=100 -journal-volume-size=10
  3. After some time the memory on EC2 instances has been exhausted, which leaded to such errors

Could you please add the heap size restriction for Cassandra service like firecamp-service-cli has for Kafka and Zookeeper (if appropriate)?

JuniusLuo commented 6 years ago

Currently we simply rely on cassandra to calculate the max heap size. Cassandra automatically calculates based on the formula: max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB). We could add the configurable max heap size for Cassandra.

Note: for testing, it is ok to run multiple services on the same node. For production, it would be better to have one node for one service only, to avoid the potential impact like this.

cloudstax commented 6 years ago

the fix for cassandra configurable heap size is committed to master. please let us know if you hit any issue!

jazzl0ver commented 6 years ago

Thank you very much! Is it possible to change the heap size on the running cluster?

JuniusLuo commented 6 years ago

Unfortunately, no. Changing the heap size on the running cluster is actually a service update operation. This is currently not supported yet. This is issue "Implement update-service operation in firecamp-service-cli". Will support it when fixing that issue.

jazzl0ver commented 6 years ago

@JuniusLuo , I'm trying to build firecamp-service-cli with the latest changes, but it fails. Most chances I set Go environment wrong. Could you please share the steps to make the build? At this moment firecamp is cloned into my home folder - /home/user/firecamp. When I cd to syssvc/firecamp-service-cli and issue go build, I'm getting:

main.go:17:2: cannot find package "github.com/cloudstax/firecamp/catalog/cassandra" in any of:
        /home/user/go/src/firecamp/vendor/github.com/cloudstax/firecamp/catalog/cassandra (vendor tree)
        /usr/lib/golang/src/github.com/cloudstax/firecamp/catalog/cassandra (from $GOROOT)
        /home/user/go/src/github.com/cloudstax/firecamp/catalog/cassandra (from $GOPATH)
JuniusLuo commented 6 years ago

go build does not work yet. Take a look of "Makefile". To build cli, run 'make install'. Then you will get firecamp-service-cli under $GOPATH/bin

jazzl0ver commented 6 years ago

Yeah, I did take a look and tried to run 'go install' in the firecamp-service-cli folder - same result as above. 'make install' in the firecamp root throws:

$ make install
./scripts/install.sh
+ protoc -I db/controldb/protocols/ db/controldb/protocols/controldb.proto --go_out=plugins=grpc:db/controldb/protocols
controldb.proto:3:10: Unrecognized syntax identifier "proto3".  This parser only recognizes "proto2".
make: *** [install] Error 1

Could you please build the latest cli and update the link at https://github.com/cloudstax/firecamp/tree/master/docs/installation#the-firecamp-service-cli ?

jazzl0ver commented 6 years ago

Alright, I've figured it out. The following procedure made it work:

$ pwd
/home/user
$ go get github.com/cloudstax/firecamp
can't load package: package github.com/cloudstax/firecamp: no Go files in /home/user/go/src/github.com/cloudstax/firecamp
$ cd /home/user/go/src/github.com/cloudstax/firecamp/syssvc/firecamp-service-cli
$ go build
$ ls -1
firecamp-service-cli
main.go

Probably it might be good to add these building instructions to the wiki

JuniusLuo commented 6 years ago

the build failure is caused by the old proto2 in your local machine. we should include the the proto3 in the vendor to fix this dependency issue.

JuniusLuo commented 6 years ago

yes, you could build the cli directly. This is a good suggestion. Added a "make cli" option to Makefile.

JuniusLuo commented 6 years ago

by the way, you could always get the latest cli from https://s3.amazonaws.com/cloudstax/firecamp/releases/latest/packages/firecamp-service-cli.tgz