Closed demisx closed 4 years ago
ts=2020-01-23T01:16:24.856391686Z caller=images.go:159 component=daemon err="fetching image metadata for index.docker.io/kope/dns-controller: Get https://index.docker.io/v2/kope/dns-controller/tags/list: error parsing HTTP 429 response body: invalid character 'T' looking for beginning of value: \"Too Many Requests (HAP429).\\n\""
This error is a cached by memcached (i.e. it's not coming live from the registry). We save the latest error from the registry into memcached, so that we can give some context as to why it isn't in the cache:
That is confusing (even for me, I will see if I can change that to indicate the error isn't live) but it's not coming from the live registry.
In the meantime, this particular error may be fixed by restarting memcached (note that this will require flux to refill the cache, which can take some time depending on the number of images it needs to scan).
From our Slack conversations it seems you did have some other errors in the logs. Would you mind sharing them?
@2opremio Thank you for your continuous help and support. You guys are excellent! The additional error I see is usually flooding my Flux log at the beginning right after a fresh helm install and then it goes away. I believe you said it was coming from the warmer. A day later, the Flux log seems nice and calm now only reporting relevant sync activity (git, cluster, our image registry).
ts=2020-01-22T23:44:26.163288733Z caller=repocachemanager.go:226
component=warmer canonical_name=index.docker.io/calico/kube-controllers auth={map[]}
err="Get https://docker-images-prod.s3.amazonaws.com/registry-v2/docker/registry/v2/blobs
/sha256/78/78faab2397fd5fe10a863dfa6ae8e9d5c539e3f0a3cb1339b941
7c35733dd294/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-
Credential=ASIA2KUBRXV6NQXE3JGN%2F20200122%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-
Date=20200122T234426Z&X-Amz-Expires=1200&X-Amz-Security-Token=... rate limited: rate:
Wait(n=1) would exceed context deadline" ref=calico/kube-controllers:v3.9.2-4-g637703c
I don't think that's a cached error.
What is happening is that you get rate limited until the cache gets filled up. That is perfectly normal, except you seem to be excluding index.docker.io
explicitly as per:
--set registry.excludeImage="docker.io/\,index.docker.io/\,quay.io/\,k8s.gcr.io/"
Would you mind sharing the Flux Deployment running in the cluster? (as created by Helm). Also, the beginning of the Flux logs would be useful. Please redact anything which could be sensitive.
@2opremio Not sure if this is what you are asking for, but this is how I initially deploy Flux to the cluster. Please note I've set registry.excludeImage
to the images that I saw rate limit exceeded for and thought should be filtered out, but I am not clear after reading all the docs if I even need to do so or whether I should remove this value and let it just run with default. Please advise:
helm install fluxcd/flux \
--name flux \
--namespace flux \
--set git.user="Flux $env_upper" \
--set git.ciSkip=true \
--set git.url=$git_url \
--set git.branch=$flux_git_branch \
--set git.path=$flux_git_path \
--set git.pollInterval=3m \
--set git.label=flux-sync-$env \
--set-file ssh.known_hosts=/tmp/flux_known_hosts \
--set registry.excludeImage="docker.io/*\,index.docker.io/*\,quay.io/*\,k8s.gcr.io/*" \
--version 1.1.0 \
--atomic
Here is the beginning of the log of a freshly installed Flux. Seems calm now 🤷♂ : flux-log.txt.zip
@demisx can you show me the output of kubectl --namespace=flux -o yaml get deployment
? (gain, please redact whatever you think is needed)
Sure. Here it is:
apiVersion: v1
items:
- apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
helm.fluxcd.io/antecedent: flux:helmrelease/flux
creationTimestamp: "2020-01-23T00:47:52Z"
generation: 3
labels:
app: flux
chart: flux-1.1.0
heritage: Tiller
release: flux
name: flux
namespace: flux
resourceVersion: "9885"
selfLink: /apis/extensions/v1beta1/namespaces/flux/deployments/flux
uid: 2e4b3d3a-3899-4d09-b7df-d96e8e405099
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: flux
release: flux
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: flux
release: flux
spec:
containers:
- args:
- --log-format=fmt
- --ssh-keygen-dir=/var/fluxd/keygen
- --k8s-secret-name=flux-git-deploy
- --memcached-hostname=flux-memcached
- --sync-state=git
- --memcached-service=
- --git-url=[redacted]
- --git-branch=master
- --git-path=k8s/prod,k8s/releases/prod
- --git-readonly=false
- --git-user=Flux PROD
- --git-email=support@weave.works
- --git-verify-signatures=false
- --git-set-author=false
- --git-poll-interval=3m
- --git-timeout=20s
- --sync-interval=3m
- --git-ci-skip=true
- --git-label=flux-sync-prod
- --automation-interval=5m
- --registry-rps=200
- --registry-burst=125
- --registry-trace=false
- --registry-exclude-image=docker.io/*,index.docker.io/*,quay.io/*,k8s.gcr.io/*
env:
- name: KUBECONFIG
value: /root/.kubectl/config
image: docker.io/fluxcd/flux:1.17.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /api/flux/v6/identity.pub
port: 3030
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: flux
ports:
- containerPort: 3030
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/flux/v6/identity.pub
port: 3030
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
requests:
cpu: 50m
memory: 64Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /root/.kubectl
name: kubedir
- mountPath: /root/.ssh
name: sshdir
readOnly: true
- mountPath: /etc/fluxd/ssh
name: git-key
readOnly: true
- mountPath: /var/fluxd/keygen
name: git-keygen
dnsPolicy: ClusterFirst
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: flux
serviceAccountName: flux
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: flux-kube-config
name: kubedir
- configMap:
defaultMode: 384
name: flux-ssh-config
name: sshdir
- name: git-key
secret:
defaultMode: 256
secretName: flux-git-deploy
- emptyDir:
medium: Memory
name: git-keygen
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2020-01-23T00:48:05Z"
lastUpdateTime: "2020-01-23T00:48:05Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2020-01-23T00:47:52Z"
lastUpdateTime: "2020-01-23T01:06:50Z"
message: ReplicaSet "flux-7d4dd69f86" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 3
readyReplicas: 1
replicas: 1
updatedReplicas: 1
- apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
helm.fluxcd.io/antecedent: flux:helmrelease/flux
creationTimestamp: "2020-01-23T00:47:52Z"
generation: 2
labels:
app: flux-memcached
chart: flux-1.1.0
heritage: Tiller
release: flux
name: flux-memcached
namespace: flux
resourceVersion: "5958"
selfLink: /apis/extensions/v1beta1/namespaces/flux/deployments/flux-memcached
uid: 4174dac5-26df-40b3-be45-f12801b42ef2
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: flux-memcached
release: flux
strategy:
type: Recreate
template:
metadata:
creationTimestamp: null
labels:
app: flux-memcached
release: flux
spec:
containers:
- args:
- -m 512
- -p 11211
- -I 5m
image: memcached:1.5.20
imagePullPolicy: IfNotPresent
name: memcached
ports:
- containerPort: 11211
name: memcached
protocol: TCP
resources: {}
securityContext:
allowPrivilegeEscalation: false
runAsGroup: 11211
runAsUser: 11211
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2020-01-23T00:48:05Z"
lastUpdateTime: "2020-01-23T00:48:05Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2020-01-23T00:47:52Z"
lastUpdateTime: "2020-01-23T00:48:05Z"
message: ReplicaSet "flux-memcached-b59f87d95" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 2
readyReplicas: 1
replicas: 1
updatedReplicas: 1
- apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2020-01-23T00:48:11Z"
generation: 1
labels:
app: helm-operator
chart: helm-operator-0.5.0
heritage: Tiller
release: helm-operator
name: helm-operator
namespace: flux
resourceVersion: "5853"
selfLink: /apis/extensions/v1beta1/namespaces/flux/deployments/helm-operator
uid: 65e40937-b4c5-4a00-84b4-0fac04440b1e
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: helm-operator
release: helm-operator
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
checksum/repositories: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
creationTimestamp: null
labels:
app: helm-operator
release: helm-operator
spec:
containers:
- args:
- --enabled-helm-versions=v2,v3
- --log-format=fmt
- --git-timeout=20s
- --git-poll-interval=5m
- --charts-sync-interval=2m
- --update-chart-deps=true
- --log-release-diffs=false
- --workers=2
- --tiller-namespace=kube-system
image: docker.io/fluxcd/helm-operator:1.0.0-rc7
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 3030
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: flux-helm-operator
ports:
- containerPort: 3030
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 3030
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
requests:
cpu: 50m
memory: 64Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/fluxd/ssh
name: git-key
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: helm-operator
serviceAccountName: helm-operator
terminationGracePeriodSeconds: 30
volumes:
- name: git-key
secret:
defaultMode: 256
secretName: flux-git-deploy
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2020-01-23T00:48:27Z"
lastUpdateTime: "2020-01-23T00:48:27Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2020-01-23T00:48:11Z"
lastUpdateTime: "2020-01-23T00:48:27Z"
message: ReplicaSet "helm-operator-6b647dd74" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 1
readyReplicas: 1
replicas: 1
updatedReplicas: 1
kind: List
metadata:
resourceVersion: ""
selfLink: ""
Great, thanks.
I was after --registry-exclude-image=docker.io/*,index.docker.io/*,quay.io/*,k8s.gcr.io/*
which is correct.
I need to look into this more deeply.
Did you try cleaning/restarting memcached? Did the errors from those registries go away after that?
(the logs you sent don't show any warmer errors)
NP. Honestly, I've purged and reinstalled Flux a dozen of times already. I believe memcached gets deleted during that process. I've also reinstalled once today and I did not see any errors. Appears to be fine. I'll let you know if they are back for some reason.
Alright, closing for now. Please add a comment if you get errors again.
I'm having what I think is this same issue. I set
helm upgrade -i qa-flux fluxcd/flux --wait \
--namespace qa-flux \
--set git.url=git@github.com:myorg/cloud-services-helm.git,git.branch=flux-test,registry.excludeImage="docker.io/*"
and deleted the memcahced pod.
My log is then flooded with
ts=2020-02-13T13:57:06.666694218Z caller=repocachemanager.go:223 component=warmer canonical_name=index.docker.io/grafana/promtail auth={map[]} warn="manifest for tag master-4f488e7 missing in repository grafana/promtail" impact="flux will fail to auto-release workloads with matching images, ask the repository administrator to fix the inconsistency"
a bunch of times, and eventually
ts=2020-02-13T13:57:09.253442437Z caller=rate_limiter.go:71 component=ratelimiter info="reducing rate limit" host=index.docker.io limit=100.00
ts=2020-02-13T13:57:09.253699491Z caller=repocachemanager.go:215 component=warmer canonical_name=index.docker.io/grafana/promtail auth={map[]} warn="aborting image tag fetching due to rate limiting, will try again later"
ts=2020-02-13T13:57:20.993325998Z caller=warming.go:206 component=warmer updated=grafana/promtail successful=162 attempted=793
ts=2020-02-13T13:57:20.993542333Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2020-02-13T13:57:23.006262082Z caller=rate_limiter.go:71 component=ratelimiter info="reducing rate limit" host=index.docker.io limit=50.00
ts=2020-02-13T13:57:23.006642843Z caller=warming.go:180 component=warmer canonical_name=index.docker.io/grafana/grafana auth={map[]} err="requesting tags: Get https://index.docker.io/v2/grafana/grafana/tags/list: error parsing HTTP 429 response body: invalid character 'T' looking for beginning of value: \"Too Many Requests (HAP429).\\n\""
it shouldn't be querying these images.
Overall, I really just want Flux to update my helm charts and images, not 3rd party ones. We can't just have new versions of infrastructure things like grafana, loki, rabbitmq, etc updating on their own. Basically, if we haven't specified a semver or glob range for an image, there is no need for Flux to ever look at it. Maybe I'm just misunderstanding how Flux works 🤷♂
We can't just have new versions of infrastructure things like grafana, loki, rabbitmq, etc updating on their own.
On this: things will only get updated if you've marked the particular workload (Deployment, DaemonSet, ..., HelmRelease) as automated with an annotation. It will in general scan all images it sees mentioned, though.
It will in general scan all images it sees mentioned, though.
Why is this?
I don't really see how this is needed. I'm absolutely certain it's just because I'm ignorant, but I honestly don't get it. Right now it just seems to be a source of confusion and copious time-delay when flux starts up. (asked similar question in #2845 )
Currently waiting 20 minutes + before flux gets through all of this and actually deploys my repo after an initial install.
UPDATE: after waiting more than 30 minutes, and it then failing to clone the repo (context deadline exceeded
), I finally deleted the flux container and re-installed with
--set registry.includeImage="docker.pkg.github.com/myorg/myrepo/*"
Everything seems to be fine now, no waiting for it to scan the entire universe of things already in my cluster (istio, grafana, etc...). I guess I still go back to my main question around why this is not just the way it works by default. Scanning images in my cluster seems completely orthogonal to the main purpose of flux, which is managing deployments for a specific repo, which likely has no bearing on other things already sitting in my cluster (meshes, metrics servers, etc).
It will in general scan all images it sees mentioned, though.
Why is this?
Two reasons, both in a sense accidents of history:
Flux was originally designed to sync our entire cluster, and there were very few workloads that were not automated, or at least under control of flux and therefore potentially automated; so there was not a perceptible gap between "workloads in the cluster" and "workloads in git";
The API used by fluxctl
(list-images etc.) was, and still is really, independent of the sync machinery, so the choice of whether to represent everything in the cluster, or just things that are in git, was less clear cut than it might seem now. There's an argument that you may want to have a view of all workloads and images, even if you don't automate them all yet, which makes some sense if you are also building a user interface for this stuff (e.g., Weave Cloud).
In any case, it seems obvious now that this behaviour is at least as surprising and unwelcome as it is helpful, and ought to be behind a flag or something. Hindsight!
Thank you for the explanation @squaremo.
I imagine you'd be open to taking PRs that aim to make the default behavior line up with the typical "new user" expectations?
Namely: flux is about managing my releases, and in fact right now each install of fluxd maps to one single repo. My expectation is that the only images flux would ever need to track are the images in that repo, and only those that I have told flux to automate, and the scanning would only need to capture the tags that match the applied filters (if any).
Does that seem like a fair assumption for desired default behavior? If you agree, I can make a new ticket that's about an implementation change to land us there. I can't necessarily promise I'd be able to successfully implement those changes, but would like to be part of the process.
I imagine you'd be open to taking PRs that aim to make the default behavior line up with the typical "new user" expectations?
Not the default behaviour (for that would be backward-incompatible, though probably not disastrously so), but perhaps the default installation. But yes, open to it certainly.
Your description of the different behaviour is :+1:. It would need some rewiring of internals -- the registry scanning is not party to the contents of the git repo, so there'd need to be a protocol between those two bits. Let's put it on the record -- yes please to a new issue.
...but perhaps the default installation
sorry, yes, that's what I meant. I'll get a new ticket going some time today. Currently have to get my head down on day job stuff :)
thanks for the discussion
Describe the bug
I have instructed Flux to exclude these images during Helm install
--set registry.excludeImage="docker.io/*\,index.docker.io/*\,quay.io/*\,k8s.gcr.io/*"
. However, the flux log still shows flux trying to access excluded images:and
fluxctl list-images --k8s-fwd-ns=flux
lists a bunch included the ones that I'd expect to be excluded and may with "(untagged) image data not available":Expected behavior
By setting
registry.excludeImage="docker.io/*\,index.docker.io/*\,quay.io/*\,k8s.gcr.io/*"
I'd expect that Flux will not query or list images matching the exclusion list.Additional context
1.17.1
1.15.6