cloudstax / firecamp

Serverless Platform for the stateful services
https://www.cloudstax.io
Apache License 2.0
210 stars 20 forks source link

create cassandra service error EOF #23

Closed jazzl0ver closed 6 years ago

jazzl0ver commented 6 years ago
# ./firecamp-service-cli -op=create-service -service-type=cassandra -region=us-east-1 -cluster=test-fc -service-name=cass-test-fc -replicas=1 -volume-size=10 -journal-volume-size=1
create cassandra service error EOF

Cassandra service is starting though w/o issues. No such issues with Zookeeper. The cli and manageserver are the latest.

JuniusLuo commented 6 years ago

Not hit this issue before. Could you please post the manage server log?

Please retry the command. The cassandra service init task may not be executed.

jazzl0ver commented 6 years ago

firecamp-test-fc.log.gz

Tried one more time with the latest release - same issue.

Update: this happens when replicas is set to 1

JuniusLuo commented 6 years ago

The server side log looks good. The Cassandra service is successfully created and initialized. The EOF error looks like connection broken. If it is easy to reproduce, could you please collect the network trace? Added some more logs. Please help to retry with the latest cli. Thanks!

jazzl0ver commented 6 years ago
# ./firecamp-service-cli -op=create-service -service-type=cassandra -region=us-east-1 -cluster=test-fc -service-name=cass-test-fc -replicas=1 -volume-size=10 -journal-volume-size=1 -volume-encrypted=true -journal-volume-encrypted=true -cas-heap-size=512
the heap size is less than 8192. Please increase it for production system
the heap size is lessn than 1024, Cassandra JVM may stall long time at GC
2018-02-07 09:42:28.1350098 +0000 UTC create cassandra service error EOF

Just in case you might want to take a look into a traffic dump:

09:42:22.506326 IP 172.22.2.56.36604 > 172.22.1.72.27040: Flags [P.], seq 1:592, ack 1, win 141, options [nop,nop,TS val 2970911164 ecr 75981], length 591
E.....@.@......8...H..i..A.U........^".....
......(.PUT /?Catalog-Create-Cassandra HTTP/1.1
Host: firecamp-manageserver.test-fc-firecamp.com:27040
User-Agent: Go-http-client/1.1
Content-Length: 416
Accept-Encoding: gzip

{"Service":{"Region":"us-east-1","Cluster":"test-fc","ServiceName":"cass-test-fc"},"Resource":{"MaxCPUUnits":0,"ReserveCPUUnits":256,"MaxMemMB":0,"ReserveMemMB":256},"Options":{"Replicas":1,"Volume":{"VolumeType":"gp2","VolumeSizeGB":10,"Iops":100,"Encrypted":true},"JournalVolume":{"VolumeType":"gp2","VolumeSizeGB":1,"Iops":0,"Encrypted":true},"HeapSizeMB":512,"JmxRemoteUser":"cassandrajmx","JmxRemotePasswd":""}}
09:42:22.506906 IP 172.22.1.72.27040 > 172.22.2.56.36604: Flags [.], ack 592, win 236, options [nop,nop,TS val 75981 ecr 2970911164], length 0
E..4.*@...K....H...8i........A......._.....
..(.....
09:42:28.134767 IP 172.22.1.72.27040 > 172.22.2.56.36604: Flags [P.], seq 1:177, ack 592, win 236, options [nop,nop,TS val 77388 ecr 2970911164], length 176
E....+@...K;...H...8i........A......`......
...L....HTTP/1.1 200 OK
Content-Type: application/json
Server: firecamp
X-Requestid: req-dfe98f04049f41f56678f1951e70036c
Date: Wed, 07 Feb 2018 09:42:28 GMT
Content-Length: 0
JuniusLuo commented 6 years ago

Thanks! Could you please upload the manage service log as well?

jazzl0ver commented 6 years ago

firecamp-manager.log.gz

JuniusLuo commented 6 years ago

Thanks! Found one possible bug. Let me test the fix.

JuniusLuo commented 6 years ago

The fix was committed. Please see if it works at your env. Simply stop the firecamp-manageserver task at ECS console. ECS will pull the latest manageserver docker image.

jazzl0ver commented 6 years ago

Now it looks like this:

# ./firecamp-service-cli -op=create-service -service-type=cassandra -region=us-east-1 -cluster=test-fc -service-name=cass-test-fc -replicas=1 -volume-size=10 -journal-volume-size=1 -volume-encrypted=true -journal-volume-encrypted=true -cas-heap-size=512
the heap size is less than 8192. Please increase it for production system
the heap size is lessn than 1024, Cassandra JVM may stall long time at GC
2018-02-07 17:54:15.393241841 +0000 UTC The catalog service is created, jmx user cassandrajmx password fe4868edb5b640ca555e29853594acb8
2018-02-07 17:54:15.393277702 +0000 UTC wait till the service gets initialized
2018-02-07 17:54:15.411316367 +0000 UTC All service containers are running, RunningCount 0

and Cassandra is not running. firecamp-manager.log.gz

JuniusLuo commented 6 years ago

the output looks weird. how could the RunningCount be 0? Could you please retry the request?

jazzl0ver commented 6 years ago

Yeah, next iteration has started it up. Thank you!

JuniusLuo commented 6 years ago

Checked the manager server log you attached. The service is successfully created. Looks it is a timing window issue. ECS may return 0 desired count for the service right after the service is created. We could add a check at cli. If the desired count is 0, wait and retry. Will add a patch.

JuniusLuo commented 6 years ago

committed a patch.