launchpad apply fails when adding DTR node

53d117460ec63d70 commented 3 years ago

When additional DTR node is added to docker ee cluster already provisioned with DTR launchpad apply fails. From the log:

time="16 Sep 20 14:04 BST" level=debug msg="UCP health check response code: 503, expected 200"
time="16 Sep 20 14:04 BST" level=info msg="Performing health check against UCP: x.x.x.x:443; elapsed: 8m20s"
time="16 Sep 20 14:04 BST" level=info msg="Performing health check against UCP: x.x.x.x:443; elapsed: 8m25s"
time="16 Sep 20 14:04 BST" level=info msg="Performing health check against UCP: x.x.x.x:443; elapsed: 8m30s"
time="16 Sep 20 14:04 BST" level=info msg="Performing health check against UCP: x.x.x.x:443; elapsed: 8m35s"
time="16 Sep 20 14:04 BST" level=info msg="Performing health check against UCP: x.x.x.x:443; elapsed: 8m40s"
time="16 Sep 20 14:04 BST" level=info msg="Performing health check against UCP: x.x.x.x:443; elapsed: 8m45s"
time="16 Sep 20 14:04 BST" level=debug msg="tracking analytics event 'Validating UCP Health'"
time="16 Sep 20 14:04 BST" level=info msg="See /home/blah/.mirantis-launchpad/cluster/launchpad-de/apply.log for more logs "
time="16 Sep 20 14:04 BST" level=debug msg="tracking analytics event 'Cluster Apply Failed'"
time="16 Sep 20 14:04 BST" level=fatal msg="failed to determine health of UCP: polling failed with 5 attempts 30s apart: unexpected response code"

I can see the node is added in the UCP console but the type is "kubernetes" whereas the original DTR node is "mixed"

kke commented 3 years ago

It appears the description of this issue is slightly incorrect, it's not about adding more DTR nodes, but adding worker nodes when there is a DTR node present.

53d117460ec63d70 commented 3 years ago

Once a DTR node is added addition of a new node of any type (master/worker/DTR) fails. If no DTR node is configured masters and workers can be scaled up and down without issue.

kke commented 3 years ago

I just tested this with 1 manager + 1 worker + 1 dtr and then added a second worker node, no problem occured there. This was with the latest beta. 🤔

53d117460ec63d70 commented 3 years ago

The issue occurs when an HTTP proxy is configured. This can be seen in the --debug output:

INFO[0107] ==> Running phase: Validating UCP Health
DEBU[0107] x.x.x.136: is the swarm leader
INFO[0107] x.x.x.136: waiting for UCP to become healthy
DEBU[0107] x.x.x.136: requesting https://localhost/_ping
DEBU[0108] x.x.x.136: response code: 200, expected 200
DEBU[0108] analytics disabled, not tracking event 'Validating UCP Health'
DEBU[0108] preparing phase 'Install DTR components'
INFO[0108] ==> Running phase: Install DTR components
DEBU[0108] x.x.x.34: found DTR installed, using as leader
INFO[0108] x.x.x.34: waiting for UCP at x.x.x.136 to become healthy
DEBU[0108] x.x.x.34: requesting https://x.x.x.136/_ping
DEBU[0184] x.x.x.34: response code: 503, expected 200

The UCP health check targets localhost from the manager machine (x.x.x.136) so localhost can be set in the no_proxy environment variable allowing this request to succeed. The same approach cannot be used for the UCP health check which is initiated from a DTR machine (x.x.x.134). If the IPs aren't know ahead of time (when using DHCP) we can't add anything meaningful to no_proxy to prevent the UCP healthcheck from the DTR machine being proxied.

EDIT: The UCP health check does succeed when the DTR is being installed:

INFO[0472] ==> Running phase: Validating UCP Health
DEBU[0472] x.x.x.136: is the swarm leader
INFO[0472] x.x.x.136: waiting for UCP to become healthy
DEBU[0472] x.x.x.136: requesting https://localhost/_ping
DEBU[0473] x.x.x.136: response code: 200, expected 200
DEBU[0473] analytics disabled, not tracking event 'Validating UCP Health'
DEBU[0473] preparing phase 'Install DTR components'
INFO[0473] ==> Running phase: Install DTR components
DEBU[0473] did not find a DTR installation, falling back to the first DTR host
INFO[0473] x.x.x.34: waiting for UCP at x.x.x.136 to become healthy
DEBU[0473] x.x.x.34: requesting https://x.x.x.136/_ping
DEBU[0474] x.x.x.34: response code: 200, expected 200
DEBU[0474] Configuring DTR replica ids to be sequential
INFO[0476] x.x.x.34:  INFO[0000] Beginning Docker Trusted Registry installation

Note the line DEBU[0473] did not find a DTR installation, falling back to the first DTR host. It seems that there is a difference in UCP healthcheck behaviour after a DTR node is installed?

kke commented 3 years ago

That's an excellent analysis of the problem 👍

This is how the remote ucp health check in 1.1.0-beta4 figures out which address to use:

If you have set --ucp-url in dtr: installFlags:, use that
If you have set (or launchpad has generated) a --san in ucp: installFlags: use that
If those fail, use the first manager's public address

In the yaml's you've sent to @jas-atwal there does not seem to be a --ucp-url set - the same logic is used to generate one. Setting that manually to an address that is accessible from the dtr nodes could perhaps solve this problem.Should --ucp-url always be the public address or will it work with the internal one?

53d117460ec63d70 commented 3 years ago

Looks like we are using scenario 3. Currently I'm using DHCP for the VM addressing and not setting up any DNS records. If I set up DNS records I could add the domain to no_proxy environment variable and scenarios 1 and 2 would mean that this would no longer be an issue. The UCP URL will always be internal, setting an HTTP proxy is only necessary for the docker ee installation so communication between nodes should never get proxied.

The odd thing is that the UCP health check from the DTR node doesn't get proxied when the DTR is installed. It is only when I scale the cluster that the UCP health check from the DTR node does get proxied. Do these different scenarios use the same code for the UCP health check? It's as if the HTTP proxy setting is not used during install but is used during scaling.

kke commented 3 years ago

When there is DTR involved, no matter if it's the first apply or not, the UCP health will be checked from the DTR leader node to the UCP leader node to validate DTR can connect to UCP. This is done even when the DTR node isn't going to be touched in any other way.

When UCP is already installed and Docker Engine on a manager node is to be upgraded, after the upgrade has finished, a healthcheck from the upgraded host to the same host's localhost is performed to validate the UCP API still works after the engine upgrade. This is done regardless of there being DTR nodes or not.

Both of the healthchecks run: curl -kso /dev/null -w "%%{http_code}" $url on the host. In the remote check case the url is built as documented in the previous response, in the local check it's https://localhost[:$controller_port]/_ping.

Mirantis / launchpad

launchpad apply fails when adding DTR node #50