fluxcd / flux

Successor: https://github.com/fluxcd/flux2
https://fluxcd.io
Apache License 2.0
6.89k stars 1.08k forks source link

flux 1.11.0 no longer syncs without ClusterRole #1830

Closed zeeZ closed 5 years ago

zeeZ commented 5 years ago

I run flux with explicit permissions, as limited as possible and with only a single namespaced Role and --k8s-namespace-whitelist set. After upgrading to 1.11.0 it no longer syncs unless it is able to list virtually everything in the cluster.

This is the ClusterRole I created from sync-loop errors before it was able to sync again. You can tell where I gave up:

apiVersion: rbac.authorization.k8s.io
kind: ClusterRole
metadata:
  labels:
    name: flux
  name: flux
rules:
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - componentstatuses
  - configmaps
  - endpoints
  - events
  - limitranges
  - namespaces
  - nodes
  - persistentvolumeclaims
  - persistentvolumes
  - pods
  - podtemplates
  - replicationcontrollers
  - "*"
  verbs:
  - list
- apiGroups:
  - apiregistration.k8s.io
  resources:
  - apiservices
  verbs:
  - list
- apiGroups:
  - extensions
  resources:
  - daemonsets
  - deployments
  - ingresses
  - networkpolicies
  - podsecuritypolicies
  - "*"
  verbs:
  - list
- apiGroups:
  - apps
  - events.k8s.io
  - autoscaling
  - batch
  - "*"
  resources:
  - "*"
  verbs:
  - list

The FAQ answers "Can I restrict the namespaces that Flux can see" with "yes, experimental". Sadly, this is no longer the case.

Also name dropping https://github.com/weaveworks/flux/issues/1217 and https://github.com/weaveworks/flux/issues/1471

squaremo commented 5 years ago

Curses, I did not intend this to be the case with #1442, though I admit I wasn't very diligent about trying out this scenario.

Where exactly does it come to a halt, when it's not given a ClusterRole? (what do the logs say?)

2opremio commented 5 years ago

https://github.com/weaveworks/flux/issues/1830 , which should fix this, is complete but pending review

zeeZ commented 5 years ago

Hey, thanks for the responses.

Where exactly does it come to a halt, when it's not given a ClusterRole? (what do the logs say?)

Without ClusterRole:

ts=2019-03-14T13:39:48.868422318Z caller=main.go:165 version=1.11.0
ERROR: logging before flag.Parse: E0314 13:39:49.929945       8 reflector.go:205] github.com/weaveworks/flux/cluster/kubernetes/cached_disco.go:100: Failed to list *v1beta1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "system:serviceaccount:flux:flux" cannot list resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope
ts=2019-03-14T13:39:49.947370986Z caller=main.go:295 component=cluster identity=/etc/fluxd/ssh/identity
ts=2019-03-14T13:39:49.947449236Z caller=main.go:296 component=cluster identity.pub="ssh-rsa ..."
ts=2019-03-14T13:39:49.947527827Z caller=main.go:297 component=cluster host=https://10.3.0.1:443 version=kubernetes-v1.12.5
ts=2019-03-14T13:39:49.947616546Z caller=main.go:309 component=cluster kubectl=/usr/local/bin/kubectl
ts=2019-03-14T13:39:49.949160458Z caller=main.go:319 component=cluster ping=true
ERROR: logging before flag.Parse: E0314 13:39:50.932939       8 reflector.go:205] github.com/weaveworks/flux/cluster/kubernetes/cached_disco.go:100: Failed to list *v1beta1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "system:serviceaccount:flux:flux" cannot list resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope

The last line is spammed forever after.

After adding the first set of permissions, updated the repo and tried to fluxctl sync:

ts=2019-03-14T13:44:07.898249713Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.11.0
ts=2019-03-14T13:44:31.133643198Z caller=loop.go:103 component=sync-loop event=refreshed url=... branch=... HEAD=beb4159a14847c5d0b0e5d4cbeccb7f3d4da2766
ts=2019-03-14T13:44:31.247826109Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: componentstatuses is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"componentstatuses\" in API group \"\" at the cluster scope"
ts=2019-03-14T13:44:31.250451239Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: componentstatuses is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"componentstatuses\" in API group \"\" at the cluster scope"
ts=2019-03-14T13:45:08.121177099Z caller=warming.go:268 component=warmer info="refreshing image" image=... tag_count=207 to_update=1 of_which_refresh=1 of_which_missing=0
ts=2019-03-14T13:45:08.139291505Z caller=warming.go:364 component=warmer updated=... successful=1 attempted=1
ts=2019-03-14T13:49:07.446622744Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T13:49:38.983850606Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=beb4159a14847c5d0b0e5d4cbeccb7f3d4da2766
ts=2019-03-14T13:54:07.629381015Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T13:54:44.119740704Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84
ts=2019-03-14T13:54:44.336051836Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: configmaps is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"configmaps\" in API group \"\" at the cluster scope"
ts=2019-03-14T13:54:44.338921916Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: configmaps is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"configmaps\" in API group \"\" at the cluster scope"
ts=2019-03-14T13:59:07.767724146Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T13:59:49.26397648Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84
ts=2019-03-14T14:04:07.889994656Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T14:04:56.89208238Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84
ts=2019-03-14T14:05:23.734827732Z caller=loop.go:111 component=sync-loop jobID=1d217122-5fbe-df8e-976f-05db5f03a6f0 state=in-progress
ts=2019-03-14T14:05:31.362681374Z caller=loop.go:123 component=sync-loop jobID=1d217122-5fbe-df8e-976f-05db5f03a6f0 state=done success=true
ts=2019-03-14T14:05:36.499539849Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84
ts=2019-03-14T14:09:08.028520016Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-14T14:10:04.550939503Z caller=loop.go:103 component=sync-loop event=refreshed url=ssh://git@....git branch=... HEAD=84970b52031752ec2790c20802a0f2419f6b4c84

Always the following after a restart with the tag behind head, with varying resources.

caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: configmaps is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"configmaps\" in API group \"\" at the cluster scope"
caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: configmaps is forbidden: User \"system:serviceaccount:flux:flux\" cannot list resource \"configmaps\" in API group \"\" at the cluster scope"

Repo tag never moved and nothing was applied. I added that resource, killed the pod and repeated until I added the * to the role. No errors after and it applied and moved the tag.

1830 , which should fix this, is complete but pending review

1668 I assume?

squaremo commented 5 years ago

Brill, thanks for that @zeeZ, most helpful!

squaremo commented 5 years ago

You might have to stick to v1.10.1 for now @zeeZ -- sorry about that :-/

2opremio commented 5 years ago

1668 I assume?

Yeah, sorry

2opremio commented 5 years ago

Now I am thinking that #1668 by itself won't be enough since it doesn't prevent flux from attempting to list cluster-scoped resources.

We need to think about this.

2opremio commented 5 years ago

@zeeZ The fix will be included in the next Fix release. For now, you can test whether your issue is definitely fixed by using image quay.io/weaveworks/flux:master-5f0e9292.

Please reopen this issue if it isn't fixed.

zeeZ commented 5 years ago

@2opremio I actually checked out your branch earlier. With no config change from 1.10.1 to yours sync worked as expected, thank you.

What remains is the following, but didn't have any impact for me as there are no CRDs managed by flux:

ERROR: logging before flag.Parse: E0315 11:00:55.601512       9 reflector.go:205] github.com/weaveworks/flux/cluster/kubernetes/cached_disco.go:100: Failed to list *v1beta1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "system:serviceaccount:flux:flux" cannot list resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope

This is repeated every second

2opremio commented 5 years ago

Fantastic! I will look into fixing that as well

2opremio commented 5 years ago

@zeeZ Are you getting any other errors? (even if not repeated)

zeeZ commented 5 years ago

No further errors after adding a watch/list CRD cluster role.

2opremio commented 5 years ago

Great, I will try to get a fix for that early next week

zeeZ commented 5 years ago

I've created a sample repo of some of the things I did to lock down Flux, maybe it can be of some use: https://github.com/zeeZ/locked-down-flux

I believe that's as far as I can go without Helm or GC enabled. Removing any of the rules defined will produce some kind of error during common operations, though I haven't played around with it enough to be able to tell where sync is actually affected and what is just noise.

2opremio commented 5 years ago

I've taken a look at the remaining recurring error. It's a tricky one because the client-go library swallows it and handles it internally (logging by default):

func (r *Reflector) Run(stopCh <-chan struct{}) {
    glog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedType, r.resyncPeriod, r.name)
    wait.Until(func() {
        if err := r.ListAndWatch(stopCh); err != nil {
            utilruntime.HandleError(err)
        }
    }, r.period, stopCh)
}

I see a bunch of options:

  1. Create a PR which passes an error-handling function to the controller and reflector (I can try, but I doubt it will succeed).
  2. Create and maintain our own implementation of the controller/reflector (which sounds awful)
  3. Modify runtime.ErrorHandlers to mute Forbidden/NotExist errors (probably a bad idea) or to do some smart error handling (probably another bad idea).

I dealt with a similar problem in Scope before, going for (2) but the error handling wasn't so deep down in the call stack.

@squaremo / @hiddeco thoughts?

squaremo commented 5 years ago

2. Create and maintain our own implementation of the controller/reflector (which sounds awful)

Yes; adapting parts of client-go is usually a quixotic enterprise. If it's much more complicated than the solution in weaveworks/scope, I'd say it's not worth it.

Can we mute glog by doing flag.Parse with some fake command-line options? I'm grasping at straws .. (it's probably better to do 3. instead)

2opremio commented 5 years ago

I went for (3) in the end

2opremio commented 5 years ago

@zeeZ It should be fixed now. I would appreciate if you could give it a try ( quay.io/weaveworks/flux:master-2d4cc4d )

zeeZ commented 5 years ago

After removing the CRD role I still get a constant stream of

ts=2019-03-18T21:05:54.062786645Z caller=main.go:175 type="internal kubernetes error" err="github.com/weaveworks/flux/cluster/kubernetes/cached_disco.go:100: Failed to list *v1beta1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User \"system:serviceaccount:flux-system:flux\" cannot list resource \"customresourcedefinitions\" in API group \"apiextensions.k8s.io\" at the cluster scope"

I did some digging around the IsForbidden || IsNotFound workaround you added, but it seems ReasonForError returns StatusReasonUnknown. I'm not familiar with K8S source, but I believe what we're dealing with here is no metav1 error but a more generic one: https://github.com/kubernetes/client-go/blob/7d04d0e2a0a1a4d4a1cd6baa432a2301492e4e65/tools/cache/reflector.go#L251

While it stings a bit, I can live with allowing CRD listing. My initial issue was with list access to everything in the cluster, which has been resolved thanks to you.

Perhaps documentation could be added with the minimum privileges Flux needs in order to operate properly, though I suspect that be complicated with helm and GC. Maybe a more restricted minimal example next to deploy?

On a positive note, at least it is not silently firing a request every second that may add up for each instance you run ;)

2opremio commented 5 years ago

Crap, sorry about that. I need to do some further thinking.

On Mon, Mar 18, 2019, 22:42 Christian notifications@github.com wrote:

After removing the CRD role I still get a constant stream of

ts=2019-03-18T21:05:54.062786645Z caller=main.go:175 type="internal kubernetes error" err="github.com/weaveworks/flux/cluster/kubernetes/cached_disco.go:100: Failed to list *v1beta1.CustomResourceDefinition: customresourcedefinitions.apiextensions.k8s.io is forbidden: User \"system:serviceaccount:flux-system:flux\" cannot list resource \"customresourcedefinitions\" in API group \"apiextensions.k8s.io\" at the cluster scope"

I did some digging around the IsForbidden || IsNotFound workaround you added, but it seems ReasonForError returns StatusReasonUnknown. I'm not familiar with K8S source, but I believe what we're dealing with here is no metav1 error but a more generic one: https://github.com/kubernetes/client-go/blob/7d04d0e2a0a1a4d4a1cd6baa432a2301492e4e65/tools/cache/reflector.go#L251

While it stings a bit, I can live with allowing CRD listing. My initial issue was with list access to everything in the cluster, which has been resolved thanks to you.

Perhaps documentation could be added with the minimum privileges Flux needs in order to operate properly, though I suspect that be complicated with helm and GC. Maybe a more restricted minimal example next to deploy?

On a positive note, at least it is not silently firing a request every second that may add up for each instance you run ;)

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/weaveworks/flux/issues/1830#issuecomment-474113083, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQOJAtebOUuSS-4-nR9ZRSwjKfPOgvyks5vYAhfgaJpZM4b0c3f .