Flux queries and lists excluded registry images

Describe the bug

I have instructed Flux to exclude these images during Helm install --set registry.excludeImage="docker.io/*\,index.docker.io/*\,quay.io/*\,k8s.gcr.io/*". However, the flux log still shows flux trying to access excluded images:

ts=2020-01-23T01:16:24.856391686Z caller=images.go:159 component=daemon err="fetching image metadata for index.docker.io/kope/dns-controller: Get https://index.docker.io/v2/kope/dns-controller/tags/list: error parsing HTTP 429 response body: invalid character 'T' looking for beginning of value: \"Too Many Requests (HAP429).\\n\""

and fluxctl list-images --k8s-fwd-ns=flux lists a bunch included the ones that I'd expect to be excluded and may with "(untagged) image data not available":

cert-manager:deployment/cert-manager                                  cert-manager                                                                          image data not available
                                                                                                     '-> (untagged)                                         ?
cert-manager:deployment/cert-manager-cainjector                       cert-manager                                                                          image data not available
                                                                                                     '-> (untagged)                                         ?
cert-manager:deployment/cert-manager-webhook                          cert-manager                                                                          image data not available
                                                                                                     '-> (untagged)                                         ?
cert-manager:helmrelease/cert-manager                                                                                                                       
default:deployment/external-dns                                       external-dns                                                                          image data not available
                                                                                                     '-> (untagged)                                         ?
default:deployment/nginx-ingress-controller                           nginx-ingress-controller                                                              image data not available
                                                                                                     '-> (untagged)                                         ?
default:deployment/nginx-ingress-default-backend                      nginx-ingress-default-backend  k8s.gcr.io/defaultbackend-amd64                        
                                                                                                     '-> 1.5                                                28 Sep 18 17:05 UTC
                                                                                                         1.4                                                24 Oct 17 17:30 UTC
                                                                                                         1.3                                                27 Feb 17 21:34 UTC
                                                                                                         1.2                                                03 Aug 16 16:18 UTC
                                                                                                         1.1                                                10 Jun 16 21:37 UTC
default:helmrelease/external-dns                                                                                                                            
default:helmrelease/nginx-ingress                                                                                                                           
flux:deployment/flux                                                  flux                                                                                  image data not available
                                                                                                     '-> (untagged)                                         ?
flux:deployment/flux-memcached                                        memcached                                                                             image data not available
                                                                                                     '-> (untagged)                                         ?
flux:deployment/helm-operator                                         flux-helm-operator                                                                    image data not available
                                                                                                     '-> (untagged)                                         ?
flux:helmrelease/flux                                                                                                                                       
kube-system:daemonset/calico-node                                     calico-node                                                                           image data not available
                                                                                                     '-> (untagged)                                         ?
                                                                      upgrade-ipam                                                                          image data not available
                                                                                                     '-> (untagged)                                         ?
                                                                      install-cni                                                                           image data not available
                                                                                                     '-> (untagged)                                         ?
                                                                      flexvol-driver                                                                        image data not available
                                                                                                     '-> (untagged)                                         ?

... 
(omitted for brevity)

Expected behavior

By setting registry.excludeImage="docker.io/*\,index.docker.io/*\,quay.io/*\,k8s.gcr.io/*" I'd expect that Flux will not query or list images matching the exclusion list.

Additional context

Flux version: 1.17.1
Kubernetes version: 1.15.6
Git provider: GitLab (self-hosted)
Container registry provider: GitLab (self-hosted)

ts=2020-01-23T01:16:24.856391686Z caller=images.go:159 component=daemon err="fetching image metadata for index.docker.io/kope/dns-controller: Get https://index.docker.io/v2/kope/dns-controller/tags/list: error parsing HTTP 429 response body: invalid character 'T' looking for beginning of value: \"Too Many Requests (HAP429).\\n\""

This error is a cached by memcached (i.e. it's not coming live from the registry). We save the latest error from the registry into memcached, so that we can give some context as to why it isn't in the cache:

https://github.com/fluxcd/flux/blob/6a0e353061f68ba902161af3b39b4e6557c23646/pkg/registry/cache/registry.go#L92-L94

That is confusing (even for me, I will see if I can change that to indicate the error isn't live) but it's not coming from the live registry.

In the meantime, this particular error may be fixed by restarting memcached (note that this will require flux to refill the cache, which can take some time depending on the number of images it needs to scan).

From our Slack conversations it seems you did have some other errors in the logs. Would you mind sharing them?

@2opremio Thank you for your continuous help and support. You guys are excellent! The additional error I see is usually flooding my Flux log at the beginning right after a fresh helm install and then it goes away. I believe you said it was coming from the warmer. A day later, the Flux log seems nice and calm now only reporting relevant sync activity (git, cluster, our image registry).

ts=2020-01-22T23:44:26.163288733Z caller=repocachemanager.go:226 
component=warmer canonical_name=index.docker.io/calico/kube-controllers auth={map[]} 
err="Get https://docker-images-prod.s3.amazonaws.com/registry-v2/docker/registry/v2/blobs
/sha256/78/78faab2397fd5fe10a863dfa6ae8e9d5c539e3f0a3cb1339b941
7c35733dd294/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-
Credential=ASIA2KUBRXV6NQXE3JGN%2F20200122%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-
Date=20200122T234426Z&X-Amz-Expires=1200&X-Amz-Security-Token=... rate limited: rate:
 Wait(n=1) would exceed context deadline" ref=calico/kube-controllers:v3.9.2-4-g637703c

I don't think that's a cached error.

What is happening is that you get rate limited until the cache gets filled up. That is perfectly normal, except you seem to be excluding index.docker.io explicitly as per:

--set registry.excludeImage="docker.io/\,index.docker.io/\,quay.io/\,k8s.gcr.io/"

Would you mind sharing the Flux Deployment running in the cluster? (as created by Helm). Also, the beginning of the Flux logs would be useful. Please redact anything which could be sensitive.

@2opremio Not sure if this is what you are asking for, but this is how I initially deploy Flux to the cluster. Please note I've set registry.excludeImage to the images that I saw rate limit exceeded for and thought should be filtered out, but I am not clear after reading all the docs if I even need to do so or whether I should remove this value and let it just run with default. Please advise:

helm install fluxcd/flux \
  --name flux \
  --namespace flux \
  --set git.user="Flux $env_upper" \
  --set git.ciSkip=true \
  --set git.url=$git_url \
  --set git.branch=$flux_git_branch \
  --set git.path=$flux_git_path \
  --set git.pollInterval=3m \
  --set git.label=flux-sync-$env \
  --set-file ssh.known_hosts=/tmp/flux_known_hosts \
  --set registry.excludeImage="docker.io/*\,index.docker.io/*\,quay.io/*\,k8s.gcr.io/*" \
  --version 1.1.0 \
  --atomic

Here is the beginning of the log of a freshly installed Flux. Seems calm now 🤷‍♂ : flux-log.txt.zip

@demisx can you show me the output of kubectl --namespace=flux -o yaml get deployment ? (gain, please redact whatever you think is needed)

Sure. Here it is:

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "2"
      helm.fluxcd.io/antecedent: flux:helmrelease/flux
    creationTimestamp: "2020-01-23T00:47:52Z"
    generation: 3
    labels:
      app: flux
      chart: flux-1.1.0
      heritage: Tiller
      release: flux
    name: flux
    namespace: flux
    resourceVersion: "9885"
    selfLink: /apis/extensions/v1beta1/namespaces/flux/deployments/flux
    uid: 2e4b3d3a-3899-4d09-b7df-d96e8e405099
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app: flux
        release: flux
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        creationTimestamp: null
        labels:
          app: flux
          release: flux
      spec:
        containers:
        - args:
          - --log-format=fmt
          - --ssh-keygen-dir=/var/fluxd/keygen
          - --k8s-secret-name=flux-git-deploy
          - --memcached-hostname=flux-memcached
          - --sync-state=git
          - --memcached-service=
          - --git-url=[redacted]
          - --git-branch=master
          - --git-path=k8s/prod,k8s/releases/prod
          - --git-readonly=false
          - --git-user=Flux PROD
          - --git-email=support@weave.works
          - --git-verify-signatures=false
          - --git-set-author=false
          - --git-poll-interval=3m
          - --git-timeout=20s
          - --sync-interval=3m
          - --git-ci-skip=true
          - --git-label=flux-sync-prod
          - --automation-interval=5m
          - --registry-rps=200
          - --registry-burst=125
          - --registry-trace=false
          - --registry-exclude-image=docker.io/*,index.docker.io/*,quay.io/*,k8s.gcr.io/*
          env:
          - name: KUBECONFIG
            value: /root/.kubectl/config
          image: docker.io/fluxcd/flux:1.17.1
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /api/flux/v6/identity.pub
              port: 3030
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: flux
          ports:
          - containerPort: 3030
            name: http
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /api/flux/v6/identity.pub
              port: 3030
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /root/.kubectl
            name: kubedir
          - mountPath: /root/.ssh
            name: sshdir
            readOnly: true
          - mountPath: /etc/fluxd/ssh
            name: git-key
            readOnly: true
          - mountPath: /var/fluxd/keygen
            name: git-keygen
        dnsPolicy: ClusterFirst
        nodeSelector:
          beta.kubernetes.io/os: linux
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: flux
        serviceAccountName: flux
        terminationGracePeriodSeconds: 30
        volumes:
        - configMap:
            defaultMode: 420
            name: flux-kube-config
          name: kubedir
        - configMap:
            defaultMode: 384
            name: flux-ssh-config
          name: sshdir
        - name: git-key
          secret:
            defaultMode: 256
            secretName: flux-git-deploy
        - emptyDir:
            medium: Memory
          name: git-keygen
  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2020-01-23T00:48:05Z"
      lastUpdateTime: "2020-01-23T00:48:05Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2020-01-23T00:47:52Z"
      lastUpdateTime: "2020-01-23T01:06:50Z"
      message: ReplicaSet "flux-7d4dd69f86" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    observedGeneration: 3
    readyReplicas: 1
    replicas: 1
    updatedReplicas: 1
- apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "1"
      helm.fluxcd.io/antecedent: flux:helmrelease/flux
    creationTimestamp: "2020-01-23T00:47:52Z"
    generation: 2
    labels:
      app: flux-memcached
      chart: flux-1.1.0
      heritage: Tiller
      release: flux
    name: flux-memcached
    namespace: flux
    resourceVersion: "5958"
    selfLink: /apis/extensions/v1beta1/namespaces/flux/deployments/flux-memcached
    uid: 4174dac5-26df-40b3-be45-f12801b42ef2
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app: flux-memcached
        release: flux
    strategy:
      type: Recreate
    template:
      metadata:
        creationTimestamp: null
        labels:
          app: flux-memcached
          release: flux
      spec:
        containers:
        - args:
          - -m 512
          - -p 11211
          - -I 5m
          image: memcached:1.5.20
          imagePullPolicy: IfNotPresent
          name: memcached
          ports:
          - containerPort: 11211
            name: memcached
            protocol: TCP
          resources: {}
          securityContext:
            allowPrivilegeEscalation: false
            runAsGroup: 11211
            runAsUser: 11211
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        dnsPolicy: ClusterFirst
        nodeSelector:
          beta.kubernetes.io/os: linux
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        terminationGracePeriodSeconds: 30
  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2020-01-23T00:48:05Z"
      lastUpdateTime: "2020-01-23T00:48:05Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2020-01-23T00:47:52Z"
      lastUpdateTime: "2020-01-23T00:48:05Z"
      message: ReplicaSet "flux-memcached-b59f87d95" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    observedGeneration: 2
    readyReplicas: 1
    replicas: 1
    updatedReplicas: 1
- apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "1"
    creationTimestamp: "2020-01-23T00:48:11Z"
    generation: 1
    labels:
      app: helm-operator
      chart: helm-operator-0.5.0
      heritage: Tiller
      release: helm-operator
    name: helm-operator
    namespace: flux
    resourceVersion: "5853"
    selfLink: /apis/extensions/v1beta1/namespaces/flux/deployments/helm-operator
    uid: 65e40937-b4c5-4a00-84b4-0fac04440b1e
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app: helm-operator
        release: helm-operator
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        annotations:
          checksum/repositories: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
        creationTimestamp: null
        labels:
          app: helm-operator
          release: helm-operator
      spec:
        containers:
        - args:
          - --enabled-helm-versions=v2,v3
          - --log-format=fmt
          - --git-timeout=20s
          - --git-poll-interval=5m
          - --charts-sync-interval=2m
          - --update-chart-deps=true
          - --log-release-diffs=false
          - --workers=2
          - --tiller-namespace=kube-system
          image: docker.io/fluxcd/helm-operator:1.0.0-rc7
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: 3030
              scheme: HTTP
            initialDelaySeconds: 1
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: flux-helm-operator
          ports:
          - containerPort: 3030
            name: http
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: 3030
              scheme: HTTP
            initialDelaySeconds: 1
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/fluxd/ssh
            name: git-key
            readOnly: true
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: helm-operator
        serviceAccountName: helm-operator
        terminationGracePeriodSeconds: 30
        volumes:
        - name: git-key
          secret:
            defaultMode: 256
            secretName: flux-git-deploy
  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2020-01-23T00:48:27Z"
      lastUpdateTime: "2020-01-23T00:48:27Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2020-01-23T00:48:11Z"
      lastUpdateTime: "2020-01-23T00:48:27Z"
      message: ReplicaSet "helm-operator-6b647dd74" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    observedGeneration: 1
    readyReplicas: 1
    replicas: 1
    updatedReplicas: 1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Great, thanks.

I was after --registry-exclude-image=docker.io/*,index.docker.io/*,quay.io/*,k8s.gcr.io/* which is correct.

I need to look into this more deeply.

Did you try cleaning/restarting memcached? Did the errors from those registries go away after that?

(the logs you sent don't show any warmer errors)

NP. Honestly, I've purged and reinstalled Flux a dozen of times already. I believe memcached gets deleted during that process. I've also reinstalled once today and I did not see any errors. Appears to be fine. I'll let you know if they are back for some reason.

Alright, closing for now. Please add a comment if you get errors again.

I'm having what I think is this same issue. I set

helm upgrade -i qa-flux fluxcd/flux --wait \
--namespace qa-flux \
--set git.url=git@github.com:myorg/cloud-services-helm.git,git.branch=flux-test,registry.excludeImage="docker.io/*"

and deleted the memcahced pod.

My log is then flooded with

ts=2020-02-13T13:57:06.666694218Z caller=repocachemanager.go:223 component=warmer canonical_name=index.docker.io/grafana/promtail auth={map[]} warn="manifest for tag master-4f488e7 missing in repository grafana/promtail" impact="flux will fail to auto-release workloads with matching images, ask the repository administrator to fix the inconsistency"

a bunch of times, and eventually

ts=2020-02-13T13:57:09.253442437Z caller=rate_limiter.go:71 component=ratelimiter info="reducing rate limit" host=index.docker.io limit=100.00

ts=2020-02-13T13:57:09.253699491Z caller=repocachemanager.go:215 component=warmer canonical_name=index.docker.io/grafana/promtail auth={map[]} warn="aborting image tag fetching due to rate limiting, will try again later"

ts=2020-02-13T13:57:20.993325998Z caller=warming.go:206 component=warmer updated=grafana/promtail successful=162 attempted=793

ts=2020-02-13T13:57:20.993542333Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"

ts=2020-02-13T13:57:23.006262082Z caller=rate_limiter.go:71 component=ratelimiter info="reducing rate limit" host=index.docker.io limit=50.00

ts=2020-02-13T13:57:23.006642843Z caller=warming.go:180 component=warmer canonical_name=index.docker.io/grafana/grafana auth={map[]} err="requesting tags: Get https://index.docker.io/v2/grafana/grafana/tags/list: error parsing HTTP 429 response body: invalid character 'T' looking for beginning of value: \"Too Many Requests (HAP429).\\n\""

it shouldn't be querying these images.

Overall, I really just want Flux to update my helm charts and images, not 3rd party ones. We can't just have new versions of infrastructure things like grafana, loki, rabbitmq, etc updating on their own. Basically, if we haven't specified a semver or glob range for an image, there is no need for Flux to ever look at it. Maybe I'm just misunderstanding how Flux works 🤷‍♂

We can't just have new versions of infrastructure things like grafana, loki, rabbitmq, etc updating on their own.

On this: things will only get updated if you've marked the particular workload (Deployment, DaemonSet, ..., HelmRelease) as automated with an annotation. It will in general scan all images it sees mentioned, though.

It will in general scan all images it sees mentioned, though.

Why is this?

I don't really see how this is needed. I'm absolutely certain it's just because I'm ignorant, but I honestly don't get it. Right now it just seems to be a source of confusion and copious time-delay when flux starts up. (asked similar question in #2845 )

Currently waiting 20 minutes + before flux gets through all of this and actually deploys my repo after an initial install.

UPDATE: after waiting more than 30 minutes, and it then failing to clone the repo (context deadline exceeded), I finally deleted the flux container and re-installed with --set registry.includeImage="docker.pkg.github.com/myorg/myrepo/*" Everything seems to be fine now, no waiting for it to scan the entire universe of things already in my cluster (istio, grafana, etc...). I guess I still go back to my main question around why this is not just the way it works by default. Scanning images in my cluster seems completely orthogonal to the main purpose of flux, which is managing deployments for a specific repo, which likely has no bearing on other things already sitting in my cluster (meshes, metrics servers, etc).

It will in general scan all images it sees mentioned, though.

Why is this?

Two reasons, both in a sense accidents of history:

Flux was originally designed to sync our entire cluster, and there were very few workloads that were not automated, or at least under control of flux and therefore potentially automated; so there was not a perceptible gap between "workloads in the cluster" and "workloads in git";
The API used by fluxctl (list-images etc.) was, and still is really, independent of the sync machinery, so the choice of whether to represent everything in the cluster, or just things that are in git, was less clear cut than it might seem now. There's an argument that you may want to have a view of all workloads and images, even if you don't automate them all yet, which makes some sense if you are also building a user interface for this stuff (e.g., Weave Cloud).

In any case, it seems obvious now that this behaviour is at least as surprising and unwelcome as it is helpful, and ought to be behind a flag or something. Hindsight!

Thank you for the explanation @squaremo.

I imagine you'd be open to taking PRs that aim to make the default behavior line up with the typical "new user" expectations?

Namely: flux is about managing my releases, and in fact right now each install of fluxd maps to one single repo. My expectation is that the only images flux would ever need to track are the images in that repo, and only those that I have told flux to automate, and the scanning would only need to capture the tags that match the applied filters (if any).

Does that seem like a fair assumption for desired default behavior? If you agree, I can make a new ticket that's about an implementation change to land us there. I can't necessarily promise I'd be able to successfully implement those changes, but would like to be part of the process.

I imagine you'd be open to taking PRs that aim to make the default behavior line up with the typical "new user" expectations?

Not the default behaviour (for that would be backward-incompatible, though probably not disastrously so), but perhaps the default installation. But yes, open to it certainly.

Your description of the different behaviour is :+1:. It would need some rewiring of internals -- the registry scanning is not party to the contents of the git repo, so there'd need to be a protocol between those two bits. Let's put it on the record -- yes please to a new issue.

...but perhaps the default installation

sorry, yes, that's what I meant. I'll get a new ticket going some time today. Currently have to get my head down on day job stuff :)

thanks for the discussion

fluxcd / flux

Flux queries and lists excluded registry images #2780