cloudstax / firecamp

Serverless Platform for the stateful services
https://www.cloudstax.io
Apache License 2.0
209 stars 20 forks source link

Proper way to restart services #15

Closed jazzl0ver closed 6 years ago

jazzl0ver commented 6 years ago

Hello. This might be a dumb question, I apologies. What is the correct way to restart a service (in terms of the cluster). For example, how to restart Kafka containers safely?

JuniusLuo commented 6 years ago

Could you please share more about the use case? Why do you want to restart a service? Do you want a rolling restart? The rolling restart restarts one member at one time, checks the cluster status, and then restarts the next member. Or it is simply stopping the service when the system is idle, and then starting it again when the load arrives?

jazzl0ver commented 6 years ago

Actually, this came from our developers who had some strange issues with Kafka and asked me to restart the service (since this is the easiest way to narrow down the issue from their point of view). Also I can imagine that due to a software bug or something a Kafka (or other app) container might just get hung and needed to killed and restarted. And it would be great to have a legitimate way to do that. So, both of restart types would be awesome to have!

JuniusLuo commented 6 years ago

Thanks for sharing the details! The rolling restart is not supported yet.

For stopping all members and then starting again, could simply use the stop-service and start-service command. Stop-service will stop all containers of one service. Start-service will start them again. Example: firecamp-service-cli -cluser=mycluster -region=us-east-1 -op=stop-service/start-service -service-name=mykafka While, stop-service simply stops all containers. When a container is stopped, the SIGTERM is sent to the main process. kafka-server-stop.sh also sends the SIGTERM to kafka process. So simply stopping the container will be safe for kafka. We will enhance to stop the members one by one.

jazzl0ver commented 6 years ago
# ./firecamp-service-cli -cluster=firecamp-qa -region=us-east-1 -op=stop-service -service-name=kafka-qa
StopService error InvalidArgs: Bad Request

the cli version is the latest (0.9.2)

JuniusLuo commented 6 years ago

Wonder how could this happen. Is server version is also 0.9.2?

jazzl0ver commented 6 years ago

oops.. it's definitely not. how to upgrade the server to 0.9.2?

JuniusLuo commented 6 years ago

For now, you would have to manually upgrade. Will write down the upgrade procedure.

JuniusLuo commented 6 years ago

Just a check. Which version are you currently running?

jazzl0ver commented 6 years ago

wondering how to check that..

JuniusLuo commented 6 years ago

login to any node, run: sudo docker plugin ls. It will show the firecamp volume plugin version. This will be the version the cluster is at.

jazzl0ver commented 6 years ago
# sudo docker plugin ls
ID                  NAME                               DESCRIPTION                         ENABLED
6afd9c9e340b        cloudstax/firecamp-volume:latest   firecamp volume plugin for docker   true

It was started on Dec 10th.

JuniusLuo commented 6 years ago

You are using the latest version, which is the master branch. not the release branch. The master branch is under active development. Please do not use it for production. Would be ok for testing, while, things may be broken. It would be better to use the release.

There is no need to upgrade. You could use the latest cli, https://s3.amazonaws.com/cloudstax/firecamp/releases/latest/packages/firecamp-service-cli.tgz, to stop service.

jazzl0ver commented 6 years ago

Same thing:

[root@dev ~]# wget https://s3.amazonaws.com/cloudstax/firecamp/releases/latest/packages/firecamp-service-cli.tgz
--2018-01-18 17:19:25--  https://s3.amazonaws.com/cloudstax/firecamp/releases/latest/packages/firecamp-service-cli.tgz
Resolving s3.amazonaws.com... 52.216.96.21
Connecting to s3.amazonaws.com|52.216.96.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2487489 (2.4M) [application/gzip]
Saving to: “firecamp-service-cli.tgz”

100%[==========================================================================================>] 2,487,489   --.-K/s   in 0.03s

2018-01-18 17:19:25 (74.2 MB/s) - “firecamp-service-cli.tgz” saved [2487489/2487489]

[root@dev ~]# tar zxf firecamp-service-cli.tgz

[root@dev ~]# ./firecamp-service-cli -cluster=firecamp-qa -region=us-east-1 -service-name=kafka-qa -op=stop-service
StopService error InvalidArgs: Bad Request
JuniusLuo commented 6 years ago

Weird. Let me try it.

JuniusLuo commented 6 years ago

Just set up a testbed, created a service, stopped and started correctly. Probably the manage service docker image is a little too old on your testbed. Could you please try to get the latest manage service docker image? And try stop-service again?

To get the latest manage service docker image.

jazzl0ver commented 6 years ago

That did the trick, thank you!

JuniusLuo commented 6 years ago

Cool!

Just one additional question: are you using ECS or Swarm?

jazzl0ver commented 6 years ago

ECS

JuniusLuo commented 6 years ago

Added the initial rolling restart support. The service containers will be deleted and recreated one by one. Only one container is deleted at one time. And after the container is recreated, the next container will be deleted. The highest replica will be restarted first. You could try the new restart-service cli. Note: restart-service is different with stop-service and start-service. restart-service will do the rolling restart. stop/start-service simply stops/starts all service containers in parallel.

jazzl0ver commented 6 years ago

Thanks a bunch! I'll let you know if I face any issues

JuniusLuo commented 6 years ago

Thanks!

cloudstax commented 6 years ago

Close this issue. If you find any bug, please reopen it or create a new issue. Thanks!