Open knweiss opened 6 years ago
Another hint regarding 2): I see the HTTP_PROXY
variable in the environment of the containerd-shim
process but not in the environment of its tectonic-torcx-hook-pre
child process:
k8s-worker1 ~ # ps axuwf|grep -B 1 /tectonic-torcx-hook-pre
root 41422 0.0 0.0 412924 4924 ? Sl 15:27 0:00 \_ /run/torcx/bin/containerd-shim c12a58976b13455f412787eb9f03142f0eb143ea11fea06081091ec866805053 /var/run/docker/libcontainerd/c12a58976b13455f412787eb9f03142f0eb143ea11fea06081091ec866805053 docker-runc
root 41438 0.0 0.0 38620 22352 ? Ssl 15:27 0:00 \_ /tectonic-torcx-hook-pre --verbose=debug --node-annotation=container-linux-update.v1.coreos.com/tectonic-torcx-pre-hook-ok --sleep=604800
k8s-worker1 ~ # cat /proc/41422/environ|grep -q HTTP_PROXY && echo found
found
k8s-worker1 ~ # cat /proc/41438/environ|grep -q HTTP_PROXY && echo found
k8s-worker1 ~ #
I found a workaround for this issue:
The torcx-pre-hook
pods get started from a DaemonSet in the tectonic-system
namespace:
$ kubectl get daemonset --namespace=tectonic-system|grep torcx-pre-hook
tectonic-torcx-pre-hook 0 0 0 0 0 container-linux-update.v1.coreos.com/before-reboot=true 25d
This Daemonset uses the OnDelete
update strategy. So I did the following:
$ kubectl edit daemonset --namespace=tectonic-system tectonic-torcx-pre-hook
In the editor I added the following six env-vars to the enviroment section:
env:
- name: NODE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: http_proxy
value: http://192.168.144.2:8118
- name: HTTP_PROXY
value: http://192.168.144.2:8118
- name: https_proxy
value: http://192.168.144.2:8118
- name: HTTPS_PROXY
value: http://192.168.144.2:8118
- name: no_proxy
value: 127.0.0.1,.localdomain
- name: NO_PROXY
value: 127.0.0.1,.localdomain
image: quay.io/coreos/tectonic-torcx:v0.1.2
The next step was to delete the hanging pod.
$ kubectl delete pods --namespace=tectonic-system tectonic-torcx-pre-hook-55m9b
After this the hanging tectonic-torcx-pre-hook
pod will be recreated with the required web proxy environment!
At this point the Pending Reboot Status started to get resolved and the Kubernetes nodes rebooted successfully one after the other.
The same fix applies to the tectonic-channel-operator
container which is used to query the Tectonic Omaha update server for the latest Tectonic release:
$ kubectl logs --namespace=tectonic-system tectonic-channel-operator-1030110693-dtc2g |tail -n 2
E1219 11:50:04.976644 1 main.go:135] Failed to get TectonicVersion from CoreUpdate: omaha: request failed: Post https://tectonic.update.core-os.net/v1/update/: dial tcp 54.208.219.41:443: i/o timeout
W1219 11:57:43.809684 1 omaha.go:232] Failed to send event back to coreupdate server: omaha: request failed: Post https://tectonic.update.core-os.net/v1/update/: dial tcp 52.201.143.107:443: i/o timeout
This container, however, is started from a Deployment instead of a DaemonSet.
$ kubectl get deployment --namespace=tectonic-system|grep tectonic-channel-operator
tectonic-channel-operator 1 1 1 1 25d
The Deployment uses a .spec.updateStrategy.type
of RollingUpdate
. Set the web proxy env-vars in the Deployment, too:
$ kubectl edit deployment --namespace=tectonic-system tectonic-channel-operator
Addendum: Also make sure you add the POD- and Service-IP-address-ranges to the no_proxy
variables as you don't want to access them via proxy.
@knweiss Can you send across what you had in /etc/systemd/system.conf.d/10-default-env.conf
?
This is a snippet i use in one of my ignition sequences to do the same thing:
- path: /etc/systemd/system.conf.d/coreos-proxy.conf
filesystem: root
mode: 0644
contents:
inline: |
[Manager]
DefaultEnvironment="http_proxy={{.proxy_endpoint}}" "HTTP_PROXY={{.proxy_endpoint}}" "https_proxy={{.proxy_endpoint}}" "HTTPS_PROXY={{.proxy_endpoint}}" "no_proxy=matchbox.sfo.rvu.io"
The biggest trick is just some of the nuances of how systemd will expand the variables [1], ensuring that i'm outputting five distinct variables and not one variable which contains '192.168.1.10 HTTP_PROXY="192.168.1.10" https_proxy="192.168.1.10" HTTTPS_PROXY="192.168.1.10" no_proxy=matchbox.sfo.rvu.io"'.
Note: This uses the Go templating functionality available within Matchbox to populate .proxy_endpoint
.
@brianredbeard I’m on vacation and can’t lookup the details right now. As far as I remember I did hardcode the proxy settings in this file and did not use variables. I come back to this issue in the last week of January.
@brianredbeard One more comment: With the workarounds mentioned above two CoreOS updates finally went through fine. However, issue #223 (Tectonic update) remained an open issue (and blocker for our test deployment) I could not work around before I went on vacation. I would appreciate if you could take a look at my detailed comment in this issue, too.
@knweiss
Also stuck with the issue #223 when trying to update behind a proxy - did you manage to find a workaround? Thanks.
@esselfour No. However, I've had a long break from this project and just resumed my testing.
@brianredbeard Sorry for the long delay. My /etc/systemd/system.conf.d/10-default-env.conf
looks different as I used one DefaultEnvironment=
line per variable:
[Manager]
DefaultEnvironment=http_proxy=http://192.168.144.2:8118
DefaultEnvironment=HTTP_PROXY=http://192.168.144.2:8118
DefaultEnvironment=https_proxy=http://192.168.144.2:8118
DefaultEnvironment=HTTPS_PROXY=http://192.168.144.2:8118
DefaultEnvironment=no_proxy=127.0.0.1,.localdomain,192.168.167.211,192.168.167.212,192.168.167.213,192.168.167.214,192.168.167.215,192.168.167.216,192.168.167.217,192.168.167.218,192.168.167.211,192.168.167.212,192.168.167.213,192.168.167.214,192.168.167.215,192.168.167.216,192.168.167.217,192.168.167.218,192.168.120.1,192.168.50.253
DefaultEnvironment=NO_PROXY=127.0.0.1,.localdomain,192.168.167.211,192.168.167.212,192.168.167.213,192.168.167.214,192.168.167.215,192.168.167.216,192.168.167.217,192.168.167.218,192.168.167.211,192.168.167.212,192.168.167.213,192.168.167.214,192.168.167.215,192.168.167.216,192.168.167.217,192.168.167.218,192.168.120.1,192.168.50.253
@brianredbeard Setting the environment variables in a single line doesn't make a difference for #223. This is still a show stopper for me. I.e. Container Linux updates work fine using the web proxy (I'm now on 1632.2.1) but Tectonic updates fail as explained in #223.
Issue Report Template
Tectonic Version
tectonic_1.7.9-tectonic.2.zip
Environment
Bare metal behind web proxy.
Expected Behavior
While installing our first Tectonic Kubernetes test cluster we had several issues with web proxy access (e.g. #234). We tried all the hints mentioned in this forum (e.g. #89) to configure a http proxy for Tectontic, Docker, etc. and got it working (after several hours). I.e. we patched the proxy environment variables into
docker.service
early-docker.service
/etc/systemd/system.conf.d/10-default-env.conf
/etc/profile.env
/etc/environment
One remaining issue, though, was the download of the
torcx_manifest.json
as this did not use the web proxy. I.e. it is still failing although the proxy access was working fine for other files/URLs.There are two places where we ran into issues with this manifest:
k8s-node-bootstrap.service
will try to download the manifest but does not use the proxy.tectonic-torcx-pre-hook
will try to download the manifest.Both failed in our tests.
Actual Behavior
In 1) we got:
A
tcpdump
trace confirmed that the container is not using the web proxy but tries to access the IP 104.16.21.26 directly:(See below for our workaround)
In 2) we still get this from the
tectonic-torcx-pre-hook
pod after a new CoreOS stable version was found, downloaded but the reboot of all nodes is pending for a couple of hours now:Reproduction Steps
torcx_manifest.json
download problems as the download is performed without using the proxy.Other Information
Feature Request
torcx_manifest.json
downloads should use the configured web proxy.Other Information
Workaround for 1):
We edited the file
/etc/systemd/system/k8s-node-bootstrap.service
on all three master nodes and modified the manifest-URL fromto
I.e. we redirected the manifest to a local webserver and put the two files
torcx_manifest.json
andtorcx_manifest.asc
there.With this modification the bootstrap finally continued and the tectonic installation finished successfully.
(After fixing this we had to re-execute
sudo systemctl start tectonic-installer
on the first master node as this services had initially failed if I remember correctly.)We're currently still trying to find a workaround for 2). Any hints?