fluxcd / flux

Successor: https://github.com/fluxcd/flux2
https://fluxcd.io
Apache License 2.0
6.9k stars 1.08k forks source link

Can't connect to memcached #1907

Closed stillinbeta closed 5 years ago

stillinbeta commented 5 years ago

Memcached appears to be running:

$ kubectl get po
NAME                         READY   STATUS    RESTARTS   AGE
c-575ffd4c95-kz4nl           1/1     Running   0          16h
flux-fd7f478d7-4bks6         1/1     Running   0          16h
memcached-549dd79996-ps4df   1/1     Running   0          10m

But Flux seemingly can't connect:

s=2019-04-08T15:09:23.046183615Z caller=memcached.go:109 component=memcached err="Fetching tag from memcache: memcache: no servers configured or available"
ts=2019-04-08T15:09:23.046197713Z caller=warming.go:170 component=warmer canonical_name=quay.io/munnerz/apiextensions-ca-helper auth={map[]} err="fetching previous result from cache: memcache: no servers configured or available"
ts=2019-04-08T15:09:23.046244011Z caller=memcached.go:109 component=memcached err="Fetching tag from memcache: memcache: no servers configured or available"
ts=2019-04-08T15:09:23.046257242Z caller=warming.go:170 component=warmer canonical_name=index.docker.io/envoyproxy/envoy auth={map[]} err="fetching previous result from cache: memcache: no servers configured or available"
ts=2019-04-08T15:09:23.046278637Z caller=memcached.go:109 component=memcached err="Fetching tag from memcache: memcache: no servers configured or available"
ts=2019-04-08T15:09:23.046308452Z caller=warming.go:170 component=warmer canonical_name=index.docker.io/digitalocean/do-csi-plugin auth={map[]} err="fetching previous result from cache: memcache: no servers configured or available"

I've tried destroying the memcached and upgrading flux, but no dice so far.

Running on digital ocean, version 13.5

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
squaremo commented 5 years ago

How did you install flux? Can you post the manifests here pls (redacted if necessary)?

stillinbeta commented 5 years ago
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flux
spec:
  replicas: 1
  selector:
    matchLabels:
      name: flux
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        prometheus.io.port: "3031" # tell prometheus to scrape /metrics endpoint's port.
      labels:
        name: flux
    spec:
      serviceAccountName: flux
      volumes:
      - name: git-key
        secret:
          secretName: flux-git-deploy
          defaultMode: 0400 # when mounted read-only, we won't be able to chmod

      # This is a tmpfs used for generating SSH keys. In K8s >= 1.10,
      # mounted secrets are read-only, so we need a separate volume we
      # can write to.
      - name: git-keygen
        emptyDir:
          medium: Memory

      # The following volume is for using a customised known_hosts
      # file, which you will need to do if you host your own git
      # repo rather than using github or the like. You'll also need to
      # mount it into the container, below. See
      # https://github.com/weaveworks/flux/blob/master/site/standalone-setup.md#using-a-private-git-host
      # - name: ssh-config
      #   configMap:
      #     name: flux-ssh-config

      # The following volume is for using a customised .kube/config,
      # which you will need to do if you wish to have a different
      # default namespace. You will also need to provide the configmap
      # with an entry for `config`, and uncomment the volumeMount and
      # env entries below.
      # - name: kubeconfig
      #   configMap:
      #     name: flux-kubeconfig

      containers:
      - name: flux
        # There are no ":latest" images for flux. Find the most recent
        # release or image version at https://quay.io/weaveworks/flux
        # and replace the tag here.
        image: quay.io/weaveworks/flux:1.11.0
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
        ports:
        - containerPort: 3030 # informational
        volumeMounts:
        - name: git-key
          mountPath: /etc/fluxd/ssh # to match location given in image's /etc/ssh/config
          readOnly: true # this will be the case perforce in K8s >=1.10
        - name: git-keygen
          mountPath: /var/fluxd/keygen # to match location given in image's /etc/ssh/config

        # Include this if you need to mount a customised known_hosts
        # file; you'll also need the volume declared above.
        # - name: ssh-config
        #   mountPath: /root/.ssh

        # Include this and the volume "kubeconfig" above, and the
        # environment entry "KUBECONFIG" below, to override the config
        # used by kuebctl.
        # - name: kubeconfig
        #   mountPath: /etc/fluxd/kube

        # Include this to point kubectl at a different config; you
        # will need to do this if you have mounted an alternate config
        # from a configmap, as in commented blocks above.
        # env:
        # - name: KUBECONFIG
        #   value: /etc/fluxd/kube/config

        args:

        # if you deployed memcached in a different namespace to flux,
        # or with a different service name, you can supply these
        # following two arguments to tell fluxd how to connect to it.
        # - --memcached-hostname=memcached.default.svc.cluster.local

        # use the memcached ClusterIP service name by setting the
        # memcached-service to string empty
        - --memcached-service=

        # this must be supplied, and be in the tmpfs (emptyDir)
        # mounted above, for K8s >= 1.10
        - --ssh-keygen-dir=/var/fluxd/keygen

        # replace or remove the following URL
        - --git-url=git@github.com:stillinbeta/leckie.git
        - --git-branch=master

        # include these next two to connect to an "upstream" service
        # (e.g., Weave Cloud). The token is particular to the service.
        # - --connect=wss://cloud.weave.works/api/flux
        # - --token=abc123abc123abc123abc123

        # serve /metrics endpoint at different port.
        # make sure to set prometheus' annotation to scrape the port value.
        - --listen-metrics=:3031

Pretty much a verbose copy from the weaveworks/flux, just added my repository

squaremo commented 5 years ago

Thanks for that -- looks OK to me, that config for fluxd. Just to check, did you create the service for memcached, in the same namespace, and does it show up in kubectl get svc?

stillinbeta commented 5 years ago
$ kubectl get svc
NAME         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
kubernetes   ClusterIP   10.245.0.1       <none>        443/TCP     7d19h
memcached    ClusterIP   10.245.117.136   <none>        11211/TCP   7d19h

they're both in the default namespace

squaremo commented 5 years ago

I see the fluxd pod has been running for 16h, and at least the service was create about a week ago -- was this working for a while and it stopped, or did it never work?

If it worked at one point then stopped, my suspicion would rest on Kubernetes' name resolution breaking down. You can test that by exec'ing into the fluxd container and seeing if it can nslookup memcached.

$ kubectl exec -ti flux-fd7f478d7-4bks6 -- /bin/sh
# nslookup memcached

Sometimes restarting the fluxd pod fixes this. In general it's safe to restart the fluxd pod, since the only state that matters is what's in your git repo and/or the cluster.

If it never worked, I'd lean towards a configuration problem. Though I'd struggle to say where -- from what you've posted, it looks fine to me :-/

stillinbeta commented 5 years ago
$ kubectl exec -ti flux-fd7f478d7-4bks6 -- /bin/sh
/home/flux # nslookup memcached
nslookup: can't resolve '(null)': Name does not resolve

Name:      memcached
Address 1: 10.245.117.136 memcached.default.svc.cluster.local
/home/flux # nc 10.245.117.136 11211
stats
STAT pid 1
STAT uptime 3072
STAT time 1554738423
STAT version 1.4.25
STAT libevent 2.0.21-stable
<snip>

so the connection seems to be fine. Restarting flux does appear to have fixed it, but that's not a particularly satisfying solution. I've had this problem occur once before already. Any ideas what could be causing it?

squaremo commented 5 years ago

Restarting flux does appear to have fixed it, but that's not a particularly satisfying solution. I've had this problem occur once before already. Any ideas what could be causing it?

There have been hints (and this is a good one) that fluxd doesn't recover well from losing its memcached connection, or from being initially unable to resolve the service's hostname.

Mind if I treat this as the definitive bug report for this particular problem?

stillinbeta commented 5 years ago

Not at all! Let me know if I can be of any help!

On Mon, 8 Apr 2019 at 12:13, Michael Bridgen notifications@github.com wrote:

Restarting flux does appear to have fixed it, but that's not a particularly satisfying solution. I've had this problem occur once before already. Any ideas what could be causing it?

There have been hints (and this is a good one) that fluxd doesn't recover well from losing its memcached connection, or from being initially unable to resolve the service's hostname.

Mind if I treat this as the definitive bug report for this particular problem?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/weaveworks/flux/issues/1907#issuecomment-480898593, or mute the thread https://github.com/notifications/unsubscribe-auth/AAf9cHBU9vbxh_ATv6ca63e-WZxy_klPks5ve2qugaJpZM4cibpR .

vanderstack commented 5 years ago

This looks very similar to the experience I had confirmed in #1766

squaremo commented 5 years ago

This looks very similar to the experience I had confirmed in #1766

Yes; it looks like the cause and the symptom may well be the same.

squaremo commented 5 years ago

I can provoke the particular log message reported above by making sure CoreDNS is not available when fluxd starts up (and then at some point shortly after, is available). For example, in a local minikube instance which is operating normally,

  1. kubectl scale -n kube-system deploy/coredns --replicas=0
  2. kubectl delete pod -l name=flux
  3. kubectl scale -n kube-system deploy/coredns --replicas=2
  4. kubectl log deploy/flux -f

This will break things fairly reliably, though I'm sure there's a particular window in which the CoreDNS outage will cause the problem in question and not others.

One probable reason for the problem is that the memcache client resolves hostnames when it starts up. My suspicion is that if it fails to do so, it'll nonetheless continue with an empty pool of addresses, which will never be populated. Thus memcache: no servers configured or available.

A fix would be to periodically reset the hostnames provided, so they will be resolved again. This will fight the memcache client's connection pooling, but I am not that concerned about the overhead of re-establishing connections, for our purposes.

squaremo commented 5 years ago

The fix (#1913) should appear in a patch release fairly soon. If you are willing to try it out, the image built from master branch, quay.io/weaveworks/flux:master-bcf0f543, includes it.