Stopping and starting a swarm cluster created with docker-machine in AWS

rmelick commented 8 years ago

Is there any guidance or documentation about the best way to stop a swarm cluster in AWS and start it back up several days later? I would like to shut down my cluster over the weekend to avoid paying for unused AWS instances.

The issue I'm having is that after starting my cluster back up, and using the regenerate-certs command on all of the hosts to update for the changed ip addresses, I still see the No healthy node available in the cluster error when trying to deploy my docker containers again.

It seems like this might be caused by the --advertise ip address of the different swarm agents not getting updated to their new ip after they restart. If I ssh to one of the nodes in my cluster, I see that the consul ip has been updated successfully, but it is still using an old ip address for the --advertise.

Below are the commands I use to create the cluster: Step 1: Create node that will host consul and start it up

docker-machine create --driver amazonec2 \
        --amazonec2-region eu-central-1 \
        --amazonec2-vpc-id $AWS_VPC_ID \
        --amazonec2-access-key $AWS_ACCESS_KEY_ID \
        --amazonec2-secret-key $AWS_SECRET_ACCESS_KEY \
        --amazonec2-instance-type t2.micro \
        --amazonec2-subnet-id $AWS_SUBNET_ID \
        --amazonec2-ssh-keypath $AWS_SSH_KEY_PEM_PATH \
        --amazonec2-root-size 40 \
        mycluster-manager
    eval $(docker-machine env "mycluster-manager")
    docker run -d -p 8500:8500 --restart=always --name=consul progrium/consul -server -bootstrap

Step 2: Create swarm master node

CONSUL_IP=$(docker-machine ip "mycluster-manager")
docker-machine create --driver amazonec2 \
          --amazonec2-region eu-central-1 \
          --amazonec2-vpc-id $AWS_VPC_ID \
          --amazonec2-access-key $AWS_ACCESS_KEY_ID \
          --amazonec2-secret-key $AWS_SECRET_ACCESS_KEY \
          --amazonec2-instance-type m4.xlarge \
          --amazonec2-subnet-id $AWS_SUBNET_ID \
          --amazonec2-ssh-keypath $AWS_SSH_KEY_PEM_PATH \
          --amazonec2-root-size 20 \
          --swarm --swarm-master \
          --swarm-discovery="consul://$CONSUL_IP:8500" --engine-opt="cluster-store=consul://$CONSUL_IP:8500" --engine-opt="cluster-advertise=eth0:2376" \
          $1

Step 3: Create nodes that join the cluster (same as step 2, except without --swarm-master)

CONSUL_IP=$(docker-machine ip "mycluster-manager")
docker-machine create --driver amazonec2 \
          --amazonec2-region eu-central-1 \
          --amazonec2-vpc-id $AWS_VPC_ID \
          --amazonec2-access-key $AWS_ACCESS_KEY_ID \
          --amazonec2-secret-key $AWS_SECRET_ACCESS_KEY \
          --amazonec2-instance-type m4.xlarge \
          --amazonec2-subnet-id $AWS_SUBNET_ID \
          --amazonec2-ssh-keypath $AWS_SSH_KEY_PEM_PATH \
          --amazonec2-root-size 40 \
          --swarm \
          --swarm-discovery="consul://$CONSUL_IP:8500" --engine-opt="cluster-store=consul://$CONSUL_IP:8500" --engine-opt="cluster-advertise=eth0:2376" \
          $1

nathanleclaire commented 8 years ago

Try docker-machine provision manager node0 node1 etc. Does that work?

nathanleclaire commented 8 years ago

Actually, come to think of it, we don't properly re-run the Swarm containers if needed. Hm...

nathanleclaire commented 8 years ago

I'd like to make provision support this use case. https://github.com/docker/machine/issues/3323 might help.

docker / machine

Stopping and starting a swarm cluster created with docker-machine in AWS #3291