deis / controller

Deis Workflow Controller (API)
https://deis.com
MIT License
41 stars 53 forks source link

RBAC support #1292

Closed Bregor closed 7 years ago

Bregor commented 7 years ago

With this change deis-controller became available to work in RBAC-only clusters

Works with both Kubernetes 1.5 and 1.6 (see templates/_helpers.tmpl for details) Actually tested with 1.5.7 and 1.6.4

ClusterRole allows deis-controller:

deis-admin commented 7 years ago

Thanks for the contribution! Please ensure your commits follow our style guide. This code will be tested once a Deis maintainer reviews it.

vdice commented 7 years ago

Jenkins, OK to test

vdice commented 7 years ago

@Bregor thanks for this feature addition!

Before I dive into testing on an RBAC-only cluster, I wanted make sure installing charts on non-RBAC clusters (as CI does) still works. As it stands, installation fails either b/c the necessary api version is missing (from ci cluster, looks like v1.5.3):

Error: apiVersion "rbac.authorization.k8s.io/v1alpha1" in workflow/charts/controller/templates/controller-clusterrolebinding.yaml is not available

or the cluster otherwise isn't RBAC-ready (the following from my v1.6.0 mk cluster):

 $ helm install --wait workflow-pr/workflow --namespace=deis --version v2.14.1-20170508161641-sha.3d6bbf9 --set fluentd.docker_tag=git-740bd61 --name deis-workflow
Error: release deis-workflow failed: clusterroles.rbac.authorization.k8s.io "deis:deis-logger-fluentd" is forbidden: attempt to grant extra privileges: [{[list] [] [pods] [] []} {[get] [] [pods] [] []} {[watch] [] [pods] [] []}] user=&{system:serviceaccount:kube-system:default e9349b4a-311c-11e7-bc10-7665e4a656ee [system:serviceaccounts system:serviceaccounts:kube-system system:authenticated] map[]} ownerrules=[{[create] [authorization.k8s.io] [selfsubjectaccessreviews] [] []} {[get] [] [] [] [/api /api/* /apis /apis/* /healthz /swaggerapi /swaggerapi/* /version]}] ruleResolutionErrors=[]

Therefore, I think the first priority is to wrap the template generation in a feature flag. Perhaps global.use_rbac or similar, defaulting to false (or commented out). When false/not set, the templates added here should not be generated. When true, they would be.

How does that sound?

vdice commented 7 years ago

Jenkins, add to whitelist

mboersma commented 7 years ago

Looks like we need specific delete permission on a namespace as well. At least, when I do deis destroy, I get an error when the controller tries to delete the namespace in an RBAC environment.

172.17.0.17 "POST /v2/apps/ HTTP/1.1" 201 170 "Deis Client dev-209c985"
INFO [aerial-yodeling]: deleting environment
ERROR:root:Uncaught Exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rest_framework/views.py", line 480, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/rest_framework/mixins.py", line 93, in destroy
    self.perform_destroy(instance)
  File "/usr/local/lib/python3.5/dist-packages/rest_framework/mixins.py", line 97, in perform_destroy
    instance.delete()
  File "/app/api/models/app.py", line 260, in delete
    self._scheduler.ns.delete(self.id)
  File "/app/scheduler/resources/namespace.py", line 52, in delete
    raise KubeHTTPException(response, 'delete Namespace "{}"', namespace)
  File "/app/scheduler/exceptions.py", line 10, in __init__
    data = response.json()
  File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 866, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
172.17.0.17 "DELETE /v2/apps/aerial-yodeling/ HTTP/1.1" 500 25 "Deis Client dev-209c985"
Bregor commented 7 years ago

@mboersma added and rebased

codecov-io commented 7 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@4b015cd). Click here to learn what that means. The diff coverage is n/a.

@@            Coverage Diff            @@
##             master    #1292   +/-   ##
=========================================
  Coverage          ?   87.04%           
=========================================
  Files             ?       45           
  Lines             ?     3929           
  Branches          ?      681           
=========================================
  Hits              ?     3420           
  Misses            ?      338           
  Partials          ?      171
Bregor commented 7 years ago

@mboersma, quick note about namespace deletion: RBAC absolutely lacks ownership you can grant to specific user creation and deletion of namespaces, BUT! you can’t list particular namespaces which he can delete. So this user will have permission to delete any namespace in cluster The only way to “hack” it is to create role (just role, not clusterrole) in every namespace belongs to user with ability to namespace deletion. And then gather all this roles in clusterrolebinding for this user. For this reason I intentionally disabled namespace deletion from the beginning. I hope we will find a way to create proper workaround for this "issue" at last :)

mboersma commented 7 years ago

For this reason I intentionally disabled namespace deletion from the beginning.

Ah, ok--that makes a lot more sense after your explanation. That's too bad the namespace-delete perm is global. It was a shortcut to have the controller just delete the entire app namespace in the first place.

It would probably be a smarter implementation to delete just the objects that controller actually created, and to print a warning if that all succeeds but the namespace deletion fails. But for now this makes RBAC workable for Workflow.

@Bregor I'll test more tomorrow. I heard you're vacationing for a week or so--enjoy! We will keep testing this and let you know if we find other issues. There should be plenty of time to get this merged for the next release. Thanks again!

kartojal commented 7 years ago

EDIT: Fixed my case after adding the verb "get" to "pods" resource at deis-controller RBAC ClusterRole.

The error code was 403 at URL /api/v1/namespaces/some-namespace/pods/some-pod-13gfabcfg2


I applied the RBAC rules, and all seems be ok, less one thing. Was trying to update a release, doing a second "git push deis some-app", and no matter if it fails or if succeeds, it always prints the next message in the controller logs:

INFO [some-app]: waited 30s and 1 pods are in service
INFO:scheduler:[some-app]: waited 30s and 1 pods are in service
ERROR:root:Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/app/api/models/build.py", line 65, in create
    self.app.deploy(new_release)
  File "/app/api/models/app.py", line 572, in deploy
    release.cleanup_old()
  File "/app/api/models/release.py", line 307, in cleanup_old
    self._scheduler.pod.delete(self.app.id, pod['metadata']['name'])
  File "/app/scheduler/resources/pod.py", line 350, in delete
    pod = self.pod.get(namespace, name).json()
  File "/app/scheduler/resources/pod.py", line 32, in get
    raise KubeHTTPException(response, message, *args)
  File "/app/scheduler/exceptions.py", line 10, in __init__
    data = response.json()
  File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 866, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rest_framework/views.py", line 480, in dispatch
    response = handler(request, *args, **kwargs)
  File "/app/api/views.py", line 527, in create
    super(BuildHookViewSet, self).create(request, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/rest_framework/mixins.py", line 21, in create
    self.perform_create(serializer)
  File "/app/api/viewsets.py", line 21, in perform_create
    self.post_save(obj)
  File "/app/api/views.py", line 533, in post_save
    build.create(self.user)
  File "/app/api/models/build.py", line 79, in create
    raise DeisException(str(e)) from e
api.exceptions.DeisException: Expecting value: line 1 column 1 (char 0)

Sometimes does not delete the original pod and creates an extra one, indicating 2 pods running when only 1 is required by the deployment. The output of git push deis some-app after a good deployment without pod failures:

remote: 2017/05/18 10:44:33 Error running git receive hook [The controller returned an error when publishing the release: Unknown Error (400): {"detail":"Expecting value: line 1 column 1 (char 0)"}]
To ssh://git@deis-builder.10.108.103.232.nip.io:2222/some-app.git
 ! [remote rejected] some-app -> some-app (pre-receive hook declined)
error: failed to push some refs to 'ssh://git@deis-builder.10.108.103.232.nip.io:2222/some-app.git'

So, it tags a good release to a "failed" one: "release h12f1ff1df23 which failed"

My RBAC controller rules look like this, maybe something is missing:

rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - create
  - delete
- apiGroups:
  - ""
  resources:
  - services
  verbs:
  - get
  - create
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - list
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - list
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - list
  - get
  - create
  - update
- apiGroups:
  - ""
  resources:
  - replicationcontrollers
  verbs:
  - list
  - get
- apiGroups:
  - ""
  resources:
  - pods/log
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - list
  - delete
- apiGroups:
  - extensions
  resources:
  - replicasets
  verbs:
  - list
  - delete
  - update
- apiGroups:
  - extensions
  - apps
  resources:
  - deployments
  verbs:
  - get
  - list
  - create
  - update
  - delete
- apiGroups:
  - extensions
  resources:
  - deployments/scale
  - replicasets/scale
  verbs:
  - get
  - update
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - delete
Bregor commented 7 years ago

@kartojal you mean deis-controller needs ability to get pods in all namespaces (clusterrole), not only in workflow's parent namespace? And second question: have you access to apiserver's audit.log? If yes, could you please grep system:serviceaccount:deis:deis-controller /path/to/audit.log while trying to update a release? This way we can catch abilities lacks in role/clusterrole. Unfortunately I can't do it myself till May 23, because I haven't access to my clusters while on vacation. Thanks in advance!

kartojal commented 7 years ago

Could check that tomorrow morning again, but yes, allowing the controller to read pods in another namespaces fixed my problem and the 403 (Unauthorized) error at updating a release. Seems like when you update a release, the controller needs to get the old release pod information, to check if the old pod have been deleted after the deployment of the new release, if i'm not wrong. Anyway tomorrow will debug again and report my findings.

Have fun!

El 19 may. 2017 10:20 a. m., "Maxim Filatov" notifications@github.com escribió:

@kartojal https://github.com/kartojal you mean deis-controller needs ability to get pods in all namespaces (clusterrole), not only in workflow's parent namespace? And second question: have you access to apiserver's audit.log? If yes, could you please grep system:serviceaccount:deis:deis-controller /path/to/audit.log while trying to update a release? This way we can catch abilities lacks in role/clusterrole. Unfortunately I can't do it myself till May 23, because I haven't access to my clusters while on vacation. Thanks in advance!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deis/controller/pull/1292#issuecomment-302640587, or mute the thread https://github.com/notifications/unsubscribe-auth/AKqXR88CU_KUwC9vzQqEyP4WpZWTkYdVks5r7VDKgaJpZM4NTHOn .

vdice commented 7 years ago

@Bregor I believe I've reproduced the same issue mentioned by @kartojal above. grepped /audit.log shows:

$ grep system:serviceaccount:deis:deis-controller /audit.log | grep pods
2017-05-25T00:09:34.153574773Z AUDIT: id="a5a3be49-c646-4652-b498-59c897c5afc4" ip="172.17.0.13" method="GET" user="system:serviceaccount:deis:deis-controller" groups="\"system:serviceaccounts\",\"system:serviceaccounts:deis\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="mk-edp" uri="/api/v1/namespaces/mk-edp/pods?labelSelector=heritage%3Ddeis"

which correlates to the fix mentioned in https://github.com/deis/controller/pull/1292#issuecomment-302389070, namely adding the get verb for the pods resource. Shall we add this in?

edit: here's the full audit.log particular to the deis-controller serviceaccount, if helps: https://gist.github.com/vdice/5648dc621a7594273db7b5b167a534f2

Bregor commented 7 years ago

Sure, thanks for testing, guys! I'll rebase and push ASAP

Bregor commented 7 years ago

@vdice, @mboersma I read carefully all the code in https://github.com/deis/controller/tree/master/rootfs/scheduler/resources and updated clusterrole accordingly.

vdice commented 7 years ago

Things look good after an additional round of testing, with specific focus on items relating to updated permissions in latest commits:

Planning on making sure others see the same and hopefully merging this along w/ the rest of the RBAC PRs on Tuesday (US holiday here on Monday).