UCSC-Treehouse / pipelines

Makefiles to run dockerized pipelines used in Treehouse on a single sample
Apache License 2.0
3 stars 6 forks source link

Unable to communicate over ssh to greater then 6 machines #5

Closed rcurrie closed 6 years ago

rcurrie commented 6 years ago

fab up:count=6 creates the machines and communication to them, whether serial or parallel functions as expected. Creating a 7th machine succeeds, but ssh communication fails with:

[10.50.102.136] Login password for 'ubuntu':

Deleting any of the machines restores communication, adding another fails. Careful inspection of the env configuration shows it to be correct. Communicating with the 7th (or any) machine via docker-machine ssh works as well.

rcurrie commented 6 years ago

Examining the paramiko log (the library implementing ssh communication for fabric) via:

import logging
logging.basicConfig(level=logging.DEBUG)

showed

DEBUG:paramiko.transport:Trying discovered key 22d06f930ce795b9496eee4f1260fb55 in /pod/home/rcurrie/.docker/machine/machines/rcurrie-treeshop-20171221-200115/id_rsa
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (publickey) failed.
DEBUG:paramiko.transport:Trying discovered key 20ba2308005c079e3b0b896c485da2cb in /pod/home/rcurrie/.docker/machine/machines/rcurrie-treeshop-20171221-200429/id_rsa
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Disconnect (code 2): Too many authentication failures
DEBUG:paramiko.transport:Trying discovered key ca145ca51dbfd922c0a8d7012fe40c9c in /pod/home/rcurrie/.docker/machine

It appears that paramiko (or openssh) doesn't use the designated ssh key for each docker-machine host but rather runs through the entire list of keys. We allocate a different key per machine due to this bug: https://github.com/docker/machine/issues/3261 which deletes existing keys if used.

Why 7 machines? The sshd daemon rejects a connection after 6 attempts because the default setting for sshd: MaxAuthTries is 6!! Earlier machines in the list are fine as they get to their key before 6 retries.

Either need to figure out how to tell paramiko exactly which key to use, or inject a single key earlier in the list to each machine so that it succeeds before 6 machines.

rcurrie commented 6 years ago

Resolved by adding single public key (~/.ssh/id_rsa.pub) to the machine after its created and setting that as the single key in env.key_filename. This way even though there is a key per machine with respect to openstack, paramiko only tries a single one. Caveat is that the openstack docker-machine driver also fails to delete the floating IPs so you need to kill them after you spin down a large cluster. But now it all runs much much faster as we don't have to fail through tons of ssh connections on every single call.