docker-archive / for-aws

92 stars 26 forks source link

New node on a D4AWS swarm fails to see a cloudstor persistent volume #131

Closed hierozbosch closed 6 years ago

hierozbosch commented 6 years ago

Expected behavior

I have been running a d4aws swarm with 3 managers, 2 workers, plus 1 additional db worker node. This swarm has an ebs-backed cloudstor volume. The shared volume contains the data used by the db worker.

I had to burn the db worker due to somebody hacking redis on it. So I did a change set on the security groups and updated the AWS CF stack--a new db instance was created and joined to the swarm. When I deployed my Docker stack on that new setup I expected the new db node to mount the cloudstor persistent volume, but it did not.

Actual behavior

All nodes are listed as Ready, including the newly created db worker (bold):

~ $ docker node ls | grep Ready
568swgemyf4jozvv2avigvimo *   ip-172-31-6-249.us-west-2.compute.internal   Ready   Active    Leader
h7xxkmbrzmjznyavp9g9s34a8     ip-172-31-1-93.us-west-2.compute.internal   Ready   Active              
jzgj9rmly5xeilgtjrk44l7cc     ip-172-31-46-144.us-west-2.compute.internal   Ready   Active              
s13jy7tdhutntzsatiu8w0u3p     ip-172-31-42-24.us-west-2.compute.internal   Ready   Active  Reachable
**u5xj82qtlezjx7zyliz6zgfop     ip-172-31-8-16.us-west-2.compute.internal   Ready   Active**              
ze8sx7hr9vhzkloh4zczpfd90     ip-172-31-16-173.us-west-2.compute.internal   Ready   Active   Reachable

Apart from the missing volume, the db worker functions normally.

All manager and worker nodes recognize the cloudstor volume:

~ $ docker volume ls
DRIVER              VOLUME NAME
cloudstor:aws       gc01_data

But the newly created db worker does not mount the shared cloudstor volume; instead creates a new local volume using the name specified in the compose file and populates a blank database:

~ $ docker volume ls
DRIVER              VOLUME NAME
local               4a865743eb1fce0eb2953025fd412f01b79edff20853f24895fa384f2137672a
local               gc01_data

On the original build, the cloudstor volume was created from the db worker node using a one-time image specially built to set up and populate that volume. The db node recognized the shared volume and mounted it through many deployments using the following compose syntax:

version: "3.2"
services:
  neo4j:
    image: neo4j:3.2.1
    environment:
      - HOME=/root
      - NEO4J_AUTH=none
    ports:
      - "7474:7474"
      - "7687:7687"
    volumes:
      - gc01_data:/data
    networks:
      - backend
    deploy:
      replicas: 1
      placement:
        constraints: [engine.labels.node_task == neo4j]
      restart_policy:
        condition: on-failure

volumes:
  gc01_data:
    external: true

I was hoping the new node would mount that shared volume with the most current data set. Otherwise I can copy the current dataset from a manager node and regenerate the volume but that's a pain.

Thanks

hierozbosch commented 6 years ago

I got this working. On the manager I shut the stack down via docker rm stack gc01_stack. Then I went on the database worker node and removed the local volume that had been created via docker volume rm gc01_data. (I should say I manually backed up my cloudstor volume of the same name before pulling that trigger.) Anyway, as soon as I got rid of the local volume, the cloudstor volume popped up:

$ docker volume ls
DRIVER              VOLUME NAME
cloudstor:aws       gc01_data

I re-deployed the stack from the manager node and all is good. Why it made a local volume, I don't know.

ddebroy commented 6 years ago

You can check the docker engine logs in the node where you saw the problem to see if there are any errors around cloudstor's IDs. It may be that when enumerating volumes, AWS choked for some reason cloudstor was not able to report the volume as already present.

hierozbosch commented 6 years ago

Haven't seen this again... problem was resolved as described, so I'm going to close.