docker-archive / for-azure

27 stars 18 forks source link

Cannot SSH into node after VM restart - no agent container #65

Open vesylapp opened 6 years ago

vesylapp commented 6 years ago

Expected behavior

Node should be accessible via SSH after VM restart

Actual behavior

Node is not accessible via SSH after VM restart

swarm-manager000000:~$ ssh swarm-manager000002

ssh: connect to host swarm-manager000002 port 22: Connection refused

Information

swarm-manager000000:~$ docker-diagnose
OK hostname=swarm-manager000000 session=1525372578-WTs5wJ17TPj4xeSY6hyt8strirowLuoR
OK hostname=swarm-manager000001 session=1525372578-WTs5wJ17TPj4xeSY6hyt8strirowLuoR
OK hostname=swarm-manager000002 session=1525372578-WTs5wJ17TPj4xeSY6hyt8strirowLuoR
OK hostname=swarm-worker000000 session=1525372578-WTs5wJ17TPj4xeSY6hyt8strirowLuoR
OK hostname=swarm-worker000001 session=1525372578-WTs5wJ17TPj4xeSY6hyt8strirowLuoR
OK hostname=swarm-worker000002 session=1525372578-WTs5wJ17TPj4xeSY6hyt8strirowLuoR
Done requesting diagnostics.
Your diagnostics session ID is 1525372578-WTs5wJ17TPj4xeSY6hyt8strirowLuoR
Please provide this session ID to the maintainer debugging your issue.

image

Steps to reproduce the behavior

  1. Go to https://docs.docker.com/docker-for-azure/
  2. Create a swarm (stable channel)
  3. Attempt to SSH into one of the nodes - works OK
  4. Restart that node VM from the Azure portal
  5. Attempt to SSH into the restarted node - fails
FrenchBen commented 6 years ago

@fslDev Can you look at the boot logs from the VM? Any information there that helps? did the machine join the cluster? If so, you can always target that machine and deploy another ssh container, that you can use.

vesylapp commented 6 years ago

@FrenchBen

did the machine join the cluster?

Yes.

If so, you can always target that machine and deploy another ssh container, that you can use.

I tried several times but was unable to get another agent container to run correctly. Do you have a docker run incantation that works? I can't find any documentation on how to properly start the agent container.

Without setting a bunch of binds and/or volumes, the container just exits abnormally. I tried to duplicate the env and binds/volumes based on a working agent container and that resulted in a container that appears to run somewhat correctly (sshd starts) but still will not accept incoming SSH connections for some reason.

Any information there that helps?

Before the restart, the boot log is 2704 lines long. After the restart, the boot log only goes to line 463. And, there is an error, /lib/rc/sh/openrc-run.sh: line 250: can't create /sys/fs/cgroup/openrc/diagnostics-server/tasks: nonexistent directory that didn't appear in the log before the restart.

Here is the last bit of the bootlog after the restart.

* Starting DHCP Client Daemon ... [ ok ]
/lib/rc/sh/openrc-run.sh: line 250: can't create /sys/fs/cgroup/openrc/diagnostics-server/tasks: nonexistent directory
 * Starting diagnostics server ... [ ok ]
 * Starting networking ... *   lo ... [ ok ]
 * Initializing random number generator ... [ ok ]
 * Starting busybox acpid ... [ ok ]
 * Running system containerd ... [ ok ]
 * Running system containers ... * [ ok ]
 * Configuring host settings from database ... [ ok ]
 * Starting Docker ...   
vesylapp commented 6 years ago

any update on this?