coreos / tectonic-forum

Apache License 2.0
30 stars 9 forks source link

torcx_manifest.json download doesn't use web proxy #235

Open knweiss opened 6 years ago

knweiss commented 6 years ago

Issue Report Template

Tectonic Version

tectonic_1.7.9-tectonic.2.zip

Environment

Bare metal behind web proxy.

Expected Behavior

While installing our first Tectonic Kubernetes test cluster we had several issues with web proxy access (e.g. #234). We tried all the hints mentioned in this forum (e.g. #89) to configure a http proxy for Tectontic, Docker, etc. and got it working (after several hours). I.e. we patched the proxy environment variables into

One remaining issue, though, was the download of the torcx_manifest.json as this did not use the web proxy. I.e. it is still failing although the proxy access was working fine for other files/URLs.

There are two places where we ran into issues with this manifest:

  1. During tectonic installation k8s-node-bootstrap.service will try to download the manifest but does not use the proxy.
  2. During CoreOS upgrade the tectonic-torcx-pre-hook will try to download the manifest.

Both failed in our tests.

Actual Behavior

In 1) we got:

Nov 22 13:36:15 k8s-worker5.localdomain docker[2394]: Error: Could not get
package manifest for 1520.8.0: could not fetch package manifest at
https://tectonic-torcx.release.core-os.net/manifests/amd64-usr/1520.8.0/torcx_manifest.json:
Get https://tectonic-torcx.release.core-os.net/manifests/amd64-usr/1520.8.0/torcx_manifest.json:
dial tcp 104.16.21.26:443: getsockopt: network is unreachable

A tcpdump trace confirmed that the container is not using the web proxy but tries to access the IP 104.16.21.26 directly:

07:56:03.815082 IP (tos 0x0, ttl 64, id 22599, offset 0, flags [DF], proto TCP (6), length 60)
    172.17.0.2.46942 > 104.16.21.26.443: Flags [S], cksum 0x296c (incorrect -> 0xbbb1), seq 2835492134, win 29200,options [mss 1460,sackOK,TS val 2137580117 ecr 0,nop,wscale 7], length 0

(See below for our workaround)

In 2) we still get this from the tectonic-torcx-pre-hook pod after a new CoreOS stable version was found, downloaded but the reboot of all nodes is pending for a couple of hours now:

$ kubectl get events --namespace=tectonic-system
LAST SEEN   FIRST SEEN   COUNT     NAME                                             KIND      SUBOBJECT                       TYPE      REASON       SOURCE                             MESSAGE
5m          17h          126       tectonic-torcx-pre-hook-3qjn5.14fd7059c58573cd   Pod       spec.containers{update-agent}   Normal    Created      kubelet, k8s-worker1.localdomain   Created container
5m          17h          126       tectonic-torcx-pre-hook-3qjn5.14fd7059cca6a500   Pod       spec.containers{update-agent}   Normal    Started      kubelet, k8s-worker1.localdomain   Started container
5m          17h          125       tectonic-torcx-pre-hook-3qjn5.14fd70c2b17a7853   Pod       spec.containers{update-agent}   Normal    Pulled       kubelet, k8s-worker1.localdomain   Container image "quay.io/coreos/tectonic-torcx:v0.1.2" already present on machine
5m          17h          605       tectonic-torcx-pre-hook-3qjn5.14fd712ba65796c7   Pod       spec.containers{update-agent}   Warning   BackOff      kubelet, k8s-worker1.localdomain   Back-off restarting failed container
5m          17h          605       tectonic-torcx-pre-hook-3qjn5.14fd712ba658ab4e   Pod                                       Warning   FailedSync   kubelet, k8s-worker1.localdomain   Error syncing pod
$ kubectl logs tectonic-torcx-pre-hook-3qjn5 --namespace=tectonic-system -f
time="2017-12-05T16:33:19Z" level=debug msg="executing: /usr/sbin/torcx [help]"
time="2017-12-05T16:33:19Z" level=debug msg="reading current OS version + board from /usr/lib/os-release"
time="2017-12-05T16:33:19Z" level=info msg="Current OS version is 1520.8.0, board is amd64-usr"
time="2017-12-05T16:33:19Z" level=info msg="using local file /etc/kubernetes/kubelet.env to determine kubernetes version"
time="2017-12-05T16:33:19Z" level=info msg="Detected Kubernetes version v1.7.9+coreos.0"
time="2017-12-05T16:33:19Z" level=info msg="Kubernetes needs Docker version(s) [1.12]"
time="2017-12-05T16:33:19Z" level=debug msg="Requesting next OS version"
time="2017-12-05T16:33:19Z" level=info msg="Next OS version is 1520.9.0"
time="2017-12-05T16:33:19Z" level=info msg="Determining correct docker version"
time="2017-12-05T16:33:19Z" level=debug msg="GET https://tectonic-torcx.release.core-os.net/manifests/amd64-usr/1520.9.0/torcx_manifest.json"
Error: Could not get package manifest for 1520.9.0: could not fetch package manifest at https://tectonic-torcx.release.core-os.net/manifests/amd64-usr/1520.9.0/torcx_manifest.json: Get https://tectonic-torcx.release.core-os.net/manifests/amd64-usr/1520.9.0/torcx_manifest.json: dial tcp 104.16.20.26:443: i/o timeout
time="2017-12-05T16:40:50Z" level=error msg="Could not get package manifest for 1520.9.0: could not fetch package manifest at https://tectonic-torcx.release.core-os.net/manifests/amd64-usr/1520.9.0/torcx_manifest.json: Get https://tectonic-torcx.release.core-os.net/manifests/amd64-usr/1520.9.0/torcx_manifest.json: dial tcp 104.16.20.26:443: i/o timeout"

Reproduction Steps

  1. Run tectonic installer on bare metal behind a web proxy.
  2. Installation will not complete because of the torcx_manifest.json download problems as the download is performed without using the proxy.

Other Information

Feature Request

torcx_manifest.json downloads should use the configured web proxy.

Other Information

Workaround for 1):

We edited the file /etc/systemd/system/k8s-node-bootstrap.service on all three master nodes and modified the manifest-URL from

--torcx-manifest-url="https://tectonic-torcx.release.core-os.net/manifests/amd64-usr/1520.8.0/torcx_manifest.json" 

to

--torcx-manifest-url="http://webserver.localdomain/torcx_manifest.json" 

I.e. we redirected the manifest to a local webserver and put the two files torcx_manifest.json and torcx_manifest.asc there.

With this modification the bootstrap finally continued and the tectonic installation finished successfully.

(After fixing this we had to re-execute sudo systemctl start tectonic-installer on the first master node as this services had initially failed if I remember correctly.)

We're currently still trying to find a workaround for 2). Any hints?

knweiss commented 6 years ago

Another hint regarding 2): I see the HTTP_PROXY variable in the environment of the containerd-shim process but not in the environment of its tectonic-torcx-hook-pre child process:

k8s-worker1 ~ # ps axuwf|grep -B 1 /tectonic-torcx-hook-pre           
root     41422  0.0  0.0 412924  4924 ?        Sl   15:27   0:00  \_ /run/torcx/bin/containerd-shim c12a58976b13455f412787eb9f03142f0eb143ea11fea06081091ec866805053 /var/run/docker/libcontainerd/c12a58976b13455f412787eb9f03142f0eb143ea11fea06081091ec866805053 docker-runc
root     41438  0.0  0.0  38620 22352 ?        Ssl  15:27   0:00      \_ /tectonic-torcx-hook-pre --verbose=debug --node-annotation=container-linux-update.v1.coreos.com/tectonic-torcx-pre-hook-ok --sleep=604800
k8s-worker1 ~ # cat /proc/41422/environ|grep -q HTTP_PROXY && echo found
found
k8s-worker1 ~ # cat /proc/41438/environ|grep -q HTTP_PROXY && echo found
k8s-worker1 ~ #
knweiss commented 6 years ago

I found a workaround for this issue:

The torcx-pre-hook pods get started from a DaemonSet in the tectonic-system namespace:

$ kubectl get daemonset --namespace=tectonic-system|grep torcx-pre-hook
tectonic-torcx-pre-hook           0         0         0         0            0           container-linux-update.v1.coreos.com/before-reboot=true   25d

This Daemonset uses the OnDelete update strategy. So I did the following:

$ kubectl edit daemonset --namespace=tectonic-system tectonic-torcx-pre-hook

In the editor I added the following six env-vars to the enviroment section:

        env:
        - name: NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: http_proxy
          value: http://192.168.144.2:8118
        - name: HTTP_PROXY
          value: http://192.168.144.2:8118
        - name: https_proxy
          value: http://192.168.144.2:8118
        - name: HTTPS_PROXY
          value: http://192.168.144.2:8118
        - name: no_proxy
          value: 127.0.0.1,.localdomain
        - name: NO_PROXY
          value: 127.0.0.1,.localdomain
        image: quay.io/coreos/tectonic-torcx:v0.1.2

The next step was to delete the hanging pod.

$ kubectl delete pods --namespace=tectonic-system tectonic-torcx-pre-hook-55m9b

After this the hanging tectonic-torcx-pre-hook pod will be recreated with the required web proxy environment!

At this point the Pending Reboot Status started to get resolved and the Kubernetes nodes rebooted successfully one after the other.

knweiss commented 6 years ago

The same fix applies to the tectonic-channel-operator container which is used to query the Tectonic Omaha update server for the latest Tectonic release:

$ kubectl logs --namespace=tectonic-system tectonic-channel-operator-1030110693-dtc2g |tail -n 2
E1219 11:50:04.976644       1 main.go:135] Failed to get TectonicVersion from CoreUpdate: omaha: request failed: Post https://tectonic.update.core-os.net/v1/update/: dial tcp 54.208.219.41:443: i/o timeout
W1219 11:57:43.809684       1 omaha.go:232] Failed to send event back to coreupdate server: omaha: request failed: Post https://tectonic.update.core-os.net/v1/update/: dial tcp 52.201.143.107:443: i/o timeout

This container, however, is started from a Deployment instead of a DaemonSet.

$ kubectl get deployment --namespace=tectonic-system|grep tectonic-channel-operator
tectonic-channel-operator               1         1         1            1           25d

The Deployment uses a .spec.updateStrategy.type of RollingUpdate. Set the web proxy env-vars in the Deployment, too:

$ kubectl edit deployment --namespace=tectonic-system tectonic-channel-operator

Addendum: Also make sure you add the POD- and Service-IP-address-ranges to the no_proxy variables as you don't want to access them via proxy.

brianredbeard commented 6 years ago

@knweiss Can you send across what you had in /etc/systemd/system.conf.d/10-default-env.conf?

This is a snippet i use in one of my ignition sequences to do the same thing:

  - path: /etc/systemd/system.conf.d/coreos-proxy.conf
    filesystem: root
    mode: 0644
    contents:
      inline: |
        [Manager]
        DefaultEnvironment="http_proxy={{.proxy_endpoint}}" "HTTP_PROXY={{.proxy_endpoint}}" "https_proxy={{.proxy_endpoint}}" "HTTPS_PROXY={{.proxy_endpoint}}" "no_proxy=matchbox.sfo.rvu.io"

The biggest trick is just some of the nuances of how systemd will expand the variables [1], ensuring that i'm outputting five distinct variables and not one variable which contains '192.168.1.10 HTTP_PROXY="192.168.1.10" https_proxy="192.168.1.10" HTTTPS_PROXY="192.168.1.10" no_proxy=matchbox.sfo.rvu.io"'. Note: This uses the Go templating functionality available within Matchbox to populate .proxy_endpoint.

knweiss commented 6 years ago

@brianredbeard I’m on vacation and can’t lookup the details right now. As far as I remember I did hardcode the proxy settings in this file and did not use variables. I come back to this issue in the last week of January.

knweiss commented 6 years ago

@brianredbeard One more comment: With the workarounds mentioned above two CoreOS updates finally went through fine. However, issue #223 (Tectonic update) remained an open issue (and blocker for our test deployment) I could not work around before I went on vacation. I would appreciate if you could take a look at my detailed comment in this issue, too.

esselfour commented 6 years ago

@knweiss

Also stuck with the issue #223 when trying to update behind a proxy - did you manage to find a workaround? Thanks.

knweiss commented 6 years ago

@esselfour No. However, I've had a long break from this project and just resumed my testing. @brianredbeard Sorry for the long delay. My /etc/systemd/system.conf.d/10-default-env.conf looks different as I used one DefaultEnvironment= line per variable:

[Manager]
DefaultEnvironment=http_proxy=http://192.168.144.2:8118
DefaultEnvironment=HTTP_PROXY=http://192.168.144.2:8118
DefaultEnvironment=https_proxy=http://192.168.144.2:8118
DefaultEnvironment=HTTPS_PROXY=http://192.168.144.2:8118
DefaultEnvironment=no_proxy=127.0.0.1,.localdomain,192.168.167.211,192.168.167.212,192.168.167.213,192.168.167.214,192.168.167.215,192.168.167.216,192.168.167.217,192.168.167.218,192.168.167.211,192.168.167.212,192.168.167.213,192.168.167.214,192.168.167.215,192.168.167.216,192.168.167.217,192.168.167.218,192.168.120.1,192.168.50.253
DefaultEnvironment=NO_PROXY=127.0.0.1,.localdomain,192.168.167.211,192.168.167.212,192.168.167.213,192.168.167.214,192.168.167.215,192.168.167.216,192.168.167.217,192.168.167.218,192.168.167.211,192.168.167.212,192.168.167.213,192.168.167.214,192.168.167.215,192.168.167.216,192.168.167.217,192.168.167.218,192.168.120.1,192.168.50.253
knweiss commented 6 years ago

@brianredbeard Setting the environment variables in a single line doesn't make a difference for #223. This is still a show stopper for me. I.e. Container Linux updates work fine using the web proxy (I'm now on 1632.2.1) but Tectonic updates fail as explained in #223.