decrypt-tls-assets.service fails to start up due to KMS connection error

anuraaga commented 8 years ago

I am trying to start a cluster using kube-aws 0.8.3, but the controller fails to start up because decrypt-tls-assets.service fails. Trying to restart it manually results in the same failure so it's not sporadic.

The error is awscli[5]: Could not connect to the endpoint URL: "https://kms.ap-northeast-1.amazonaws.com/"

$ systemctl status decrypt-tls-assets
● decrypt-tls-assets.service - decrypt kubelet tls assets using amazon kms
   Loaded: loaded (/etc/systemd/system/decrypt-tls-assets.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2016-10-27 08:36:42 UTC; 1min 51s ago
  Process: 1952 ExecStart=/opt/bin/decrypt-tls-assets (code=exited, status=255)
 Main PID: 1952 (code=exited, status=255)

Oct 27 08:36:05 ip-172-216-1-10.ap-northeast-1.compute.internal systemd[1]: Starting decrypt kubelet tls assets using amazon kms...
Oct 27 08:36:05 ip-172-216-1-10.ap-northeast-1.compute.internal sudo[1954]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/rkt run --volume=ssl,kind=host,source=/etc/kubernetes/ssl,readOnly=false --mount=volume=s
Oct 27 08:36:05 ip-172-216-1-10.ap-northeast-1.compute.internal sudo[1954]: pam_unix(sudo:session): session opened for user root by (uid=0)
Oct 27 08:36:05 ip-172-216-1-10.ap-northeast-1.compute.internal decrypt-tls-assets[1952]: image: using image from local store for image name coreos.com/rkt/stage1-coreos:1.8.0
Oct 27 08:36:05 ip-172-216-1-10.ap-northeast-1.compute.internal decrypt-tls-assets[1952]: image: using image from local store for image name quay.io/coreos/awscli
Oct 27 08:36:41 ip-172-216-1-10.ap-northeast-1.compute.internal decrypt-tls-assets[1952]: [  817.933836] awscli[5]: Could not connect to the endpoint URL: "https://kms.ap-northeast-1.amazonaws.com/"
Oct 27 08:36:42 ip-172-216-1-10.ap-northeast-1.compute.internal systemd[1]: decrypt-tls-assets.service: Main process exited, code=exited, status=255/n/a
Oct 27 08:36:42 ip-172-216-1-10.ap-northeast-1.compute.internal systemd[1]: Failed to start decrypt kubelet tls assets using amazon kms.
Oct 27 08:36:42 ip-172-216-1-10.ap-northeast-1.compute.internal systemd[1]: decrypt-tls-assets.service: Unit entered failed state.
Oct 27 08:36:42 ip-172-216-1-10.ap-northeast-1.compute.internal systemd[1]: decrypt-tls-assets.service: Failed with result 'exit-code'.

I can use curl to access the URL from the node

$ curl https://kms.ap-northeast-1.amazonaws.com/
<MissingAuthenticationTokenException>
  <Message>Missing Authentication Token</Message>
</MissingAuthenticationTokenException>

When running curl in rkt, it cannot resolve the hostname

$ sudo rkt run --net=host --dns=8.8.8.8 quay.io/coreos/awscli --exec=/bin/bash -- -c "/usr/bin/curl https://kms.ap-northeast-1.amazonaws.com/"image: using image from local store for image name coreos.com/rkt/stage1-coreos:1.8.0
image: using image from local store for image name quay.io/coreos/awscli
[ 2001.883010] awscli[5]: curl: (6) Couldn't resolve host 'kms.ap-northeast-1.amazonaws.com'

However ping 8.8.8.8 within rkt works fine.

 $ sudo rkt run --net=host --dns=8.8.8.8 quay.io/coreos/awscli --exec=/bin/bash -- -c "ping 8.8.8.8"
image: using image from local store for image name coreos.com/rkt/stage1-coreos:1.8.0
image: using image from local store for image name quay.io/coreos/awscli
[ 2036.786210] awscli[5]: PING 8.8.8.8 (8.8.8.8): 56 data bytes
[ 2036.801570] awscli[5]: 64 bytes from 8.8.8.8: seq=0 ttl=59 time=1.751 ms
[ 2037.513293] awscli[5]: 64 bytes from 8.8.8.8: seq=240 ttl=59 time=1.
...

anuraaga commented 8 years ago

I found that replacing --dns=8.8.8.8 with mounting the host's /etc/resolv.conf works. Is there a reason not to mount resolv.conf for this command?

Works:

  - path: /opt/bin/decrypt-tls-assets
    owner: root:root
    permissions: 0700
    content: |
      #!/bin/bash -e

      for encKey in $(find /etc/kubernetes/ssl/*.pem.enc);do
        sudo rkt run \
        --volume=ssl,kind=host,source=/etc/kubernetes/ssl,readOnly=false \
        --mount=volume=ssl,target=/etc/kubernetes/ssl \
        --uuid-file-save=/var/run/coreos/decrypt-tls-assets.uuid \
        --volume=dns,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=dns,target=/etc/resolv.conf \
        --net=host \
        --trust-keys-from-https \
        quay.io/coreos/awscli --exec=/bin/bash -- \
          -c \
          "/usr/bin/aws \
            --region {{.Region}} kms decrypt \
            --ciphertext-blob fileb://$encKey \
            --output text \
            --query Plaintext \
            > $encKey.b64"

        base64 --decode < $encKey.b64 > ${encKey%.enc}
        sudo rkt rm --uuid-file=/var/run/coreos/decrypt-tls-assets.uuid
      done

pieterlange commented 8 years ago

:+1: i think we should be careful in general with peppering code to use 8.8.8.8.

colhom commented 8 years ago

Hello Kubernetes Community,

Future work on kube-aws will be moved to a new dedicated repository. @mumoshu will be running point on maintaining that repository- please move all issues and PRs over there as soon as you can. We will be halting active development on the AWS portion of this repository in the near future. We will continue to maintain the vagrant single and multi-node distributions in this repository, along with our hyperkube container image image.

A community announcement to end users will be made once the transition is complete. We at CoreOS ask that those reading this message avoid publicizing/blogging about the transition until the official annoucement has been made to the community in the next week.

The new dedicated kube-aws repository already has the following features merged in:

Discrete etcd cluster
HA control plane
Cluster upgrades
Node draining/cordoning

If anyone in the Kubernetes community would like to be involved with maintaining this new repository, find @chom and/or @mumoshu on the Kubernetes slack in the #sig-aws channel or via direct message.

~CoreOS Infra Team

mumoshu commented 8 years ago

@anuraaga Just curious but could it be possible that your infrastructure/network are blocking access to Google Public DNS?

// Once I understand the problem correctly, I'd like to merge https://github.com/coreos/kube-aws/issues/6 asap!

anuraaga commented 8 years ago

Thanks for checking on this, I should have verified the lookup against 8.8.8.8 on the node once with dig. Indeed, this wasn't working

dig @8.8.8.8 kms.ap-northeast-1.amazonaws.com

It's weird since I had opened up port 53 for both TCP and UDP on the network ACL. I found that the only way I could reliably get that dig command to work was to open up UDP 1-65535. Cutting the range down to, e.g. 400000-65535 would allow maybe 30% of the requests to work, as if it's just picking a random port each time. Not sure why this behavior would happen. Definitely don't want to have to open up all these ports.

Anyways, would still prefer to have DNS requests in general not leaving the VPC even during Kubernetes bootstrap if it makes sense.

mumoshu commented 8 years ago

@anuraaga Thanks for your response 😄 I guess you should open up ports from 32768 to 61000 in one of network ACL's outbound rules.

AFAIK, or I believe that, linux kernels generally use ephemeral ports ranging from 32768 to 61000 hence DNS clients also use ephemeral ports for source ips. This documentation for AWS VPC would help.

// I'm not certainly sure how exactly DNS clients work so please correct me if this seems wrong.

If you have some time, I suggest running tcpdump port 53 -nn and then dig in another console to see SRC IP varies in the range mentioned above, which means you should open up the range used for ephemeral ports in outbound rules.

mumoshu commented 8 years ago

Anyway,

would still prefer to have DNS requests in general not leaving the VPC even during Kubernetes bootstrap if it makes sense.

I agree to this 👍

anuraaga commented 8 years ago

Thanks for the explanation, my understanding of ephemeral ports was lacking but it makes sense :)

mumoshu commented 8 years ago

Thanks for your confirmation! FYI, https://github.com/coreos/kube-aws/pull/6 is merged.

coreos / coreos-kubernetes

decrypt-tls-assets.service fails to start up due to KMS connection error #744