Mirantis / launchpad

Other
27 stars 45 forks source link

SSH packets get dropped when deploying onto Ubuntu in Azure public cloud #51

Closed 53d117460ec63d70 closed 3 years ago

53d117460ec63d70 commented 3 years ago

When using launchpad to deploy docker ee onto Ubuntu VMs in Azure public cloud the installation hangs at the following point:

INFO[0020] ==> Running phase: Install Docker EE Engine on the hosts

After this it is no longer possible to ssh onto the VM. A packet capture on the VM (via serial console) shows that the SSH TCP SYN packets are not being ACKed. Is the docker ee install adding some firewall or iptable rules that are causing this?

$ launchpad version
version: 1.1.0-beta3
commit: 11f7d21
kke commented 3 years ago

Anything in the debug logs? (run with launchpad --debug apply or take a look in the ~/.mirantis-launchpad/cluster/<CLUSTER_NAME>/install.log.

kke commented 3 years ago

The previous phase does apt-get install -y -q curl apt-utils socat iputils-ping but I don't see why that would kill the connection.

kke commented 3 years ago

Just a thought, could there be some kind of keepalive requirement in the sshd config, this just popped in my head, I think it is possible that launchpad does not send ssh keepalives. A bit far fetched but I guess in theory it could be possible.

53d117460ec63d70 commented 3 years ago

with ---debug it hangs here:

INFO[0228] x.x.x.4: installing engine (19.03.12)
INFO[0228] x.x.x.4: installing engine (19.03.12)
DEBU[0228] x.x.x.5:  + sudo -E sh -c 'apt-get update -qq'
DEBU[0228] x.x.x.5:  + sudo -E sh -c 'apt-get install -y -qq apt-transport-https ca-certificates curl software-properties-common >/dev/null'
DEBU[0228] x.x.x.5:  curl: (22) The requested URL returned error: 404 Not Found
DEBU[0228] x.x.x.5:  + sudo -E sh -c 'curl -fsSL https://repos.mirantis.com/ubuntu/gpg | apt-key add -qq - >/dev/null'
DEBU[0228] x.x.x.5:  + sudo -E sh -c 'add-apt-repository '\''deb [arch=amd64] https://repos.mirantis.com/ubuntu xenial stable'\'' >/dev/null'
DEBU[0228] x.x.x.5:  + sudo -E sh -c 'apt-get update -qq >/dev/null'
DEBU[0228] x.x.x.5:  + sudo -E sh -c 'apt-get install -y --allow-downgrades -qq docker-ee=5:19.03.12~3-0~ubuntu-xenial docker-ee-cli=5:19.03.12~3-0~ubuntu-xenial'
INFO[0228] x.x.x.5: installing engine (19.03.12)
INFO[0229] x.x.x.4: installing engine (19.03.12)

The apply log has this message repeated:

time="29 Sep 20 14:33 BST" level=error msg="x.x.x.4: failed to install engine -> All attempts fail:\n#1: wait: remote command exited without exit status or exit signal\n#2: read tcp x.x.x.x:52832->x.x.x.4:22: read: connection timed out\

After this error I can no longer connect to the SSH port (22) on the VM. A tcpdump on the VM shows the SYN packets arriving at the VM but not getting ACKed. I think that some part of the docker-ee installation is configuring iptables or firewall rules to drop these packets.

kke commented 3 years ago

Do you have some special rules configured to iptables in the images?

53d117460ec63d70 commented 3 years ago

No and this communication only breaks during the launchpad apply. It would be great if there was some example code for Azure that we could test as it's most likely that we've missed something required for that cloud provider.

kke commented 3 years ago

In PR #53 there's terraform config that seem to work on azure, except for the fact that windows machines need to be rebooted after engine install, which is not yet included in any released version of launchpad.

kke commented 3 years ago

Was this resolved?

53d117460ec63d70 commented 3 years ago

I will test with the azure example and open a new ticket if the issue reoccurs. Thanks.