Closed 53d117460ec63d70 closed 3 years ago
It appears the description of this issue is slightly incorrect, it's not about adding more DTR nodes, but adding worker nodes when there is a DTR node present.
Once a DTR node is added addition of a new node of any type (master/worker/DTR) fails. If no DTR node is configured masters and workers can be scaled up and down without issue.
I just tested this with 1 manager + 1 worker + 1 dtr and then added a second worker node, no problem occured there. This was with the latest beta. 🤔
The issue occurs when an HTTP proxy is configured. This can be seen in the --debug
output:
INFO[0107] ==> Running phase: Validating UCP Health
DEBU[0107] x.x.x.136: is the swarm leader
INFO[0107] x.x.x.136: waiting for UCP to become healthy
DEBU[0107] x.x.x.136: requesting https://localhost/_ping
DEBU[0108] x.x.x.136: response code: 200, expected 200
DEBU[0108] analytics disabled, not tracking event 'Validating UCP Health'
DEBU[0108] preparing phase 'Install DTR components'
INFO[0108] ==> Running phase: Install DTR components
DEBU[0108] x.x.x.34: found DTR installed, using as leader
INFO[0108] x.x.x.34: waiting for UCP at x.x.x.136 to become healthy
DEBU[0108] x.x.x.34: requesting https://x.x.x.136/_ping
DEBU[0184] x.x.x.34: response code: 503, expected 200
The UCP health check targets localhost
from the manager machine (x.x.x.136) so localhost
can be set in the no_proxy
environment variable allowing this request to succeed. The same approach cannot be used for the UCP health check which is initiated from a DTR machine (x.x.x.134). If the IPs aren't know ahead of time (when using DHCP) we can't add anything meaningful to no_proxy
to prevent the UCP healthcheck from the DTR machine being proxied.
EDIT: The UCP health check does succeed when the DTR is being installed:
INFO[0472] ==> Running phase: Validating UCP Health
DEBU[0472] x.x.x.136: is the swarm leader
INFO[0472] x.x.x.136: waiting for UCP to become healthy
DEBU[0472] x.x.x.136: requesting https://localhost/_ping
DEBU[0473] x.x.x.136: response code: 200, expected 200
DEBU[0473] analytics disabled, not tracking event 'Validating UCP Health'
DEBU[0473] preparing phase 'Install DTR components'
INFO[0473] ==> Running phase: Install DTR components
DEBU[0473] did not find a DTR installation, falling back to the first DTR host
INFO[0473] x.x.x.34: waiting for UCP at x.x.x.136 to become healthy
DEBU[0473] x.x.x.34: requesting https://x.x.x.136/_ping
DEBU[0474] x.x.x.34: response code: 200, expected 200
DEBU[0474] Configuring DTR replica ids to be sequential
INFO[0476] x.x.x.34: INFO[0000] Beginning Docker Trusted Registry installation
Note the line DEBU[0473] did not find a DTR installation, falling back to the first DTR host
. It seems that there is a difference in UCP healthcheck behaviour after a DTR node is installed?
That's an excellent analysis of the problem 👍
This is how the remote ucp health check in 1.1.0-beta4 figures out which address to use:
--ucp-url
in dtr: installFlags:
, use that --san
in ucp: installFlags:
use thatIn the yaml's you've sent to @jas-atwal there does not seem to be a --ucp-url
set - the same logic is used to generate one. Setting that manually to an address that is accessible from the dtr nodes could perhaps solve this problem.Should --ucp-url
always be the public address or will it work with the internal one?
Looks like we are using scenario 3. Currently I'm using DHCP for the VM addressing and not setting up any DNS records. If I set up DNS records I could add the domain to no_proxy
environment variable and scenarios 1 and 2 would mean that this would no longer be an issue. The UCP URL will always be internal, setting an HTTP proxy is only necessary for the docker ee installation so communication between nodes should never get proxied.
The odd thing is that the UCP health check from the DTR node doesn't get proxied when the DTR is installed. It is only when I scale the cluster that the UCP health check from the DTR node does get proxied. Do these different scenarios use the same code for the UCP health check? It's as if the HTTP proxy setting is not used during install but is used during scaling.
When there is DTR involved, no matter if it's the first apply
or not, the UCP health will be checked from the DTR leader node to the UCP leader node to validate DTR can connect to UCP. This is done even when the DTR node isn't going to be touched in any other way.
When UCP is already installed and Docker Engine on a manager node is to be upgraded, after the upgrade has finished, a healthcheck from the upgraded host to the same host's localhost is performed to validate the UCP API still works after the engine upgrade. This is done regardless of there being DTR nodes or not.
Both of the healthchecks run: curl -kso /dev/null -w "%%{http_code}" $url
on the host. In the remote check case the url is built as documented in the previous response, in the local check it's https://localhost[:$controller_port]/_ping
.
When additional DTR node is added to docker ee cluster already provisioned with DTR
launchpad apply
fails. From the log:I can see the node is added in the UCP console but the type is "kubernetes" whereas the original DTR node is "mixed"