aws-samples / eks-kubeflow-workshop

Kubeflow workshop on EKS. Mainly focus on AWS integration examples. Please go check kubeflow website http://kubeflow.org for other examples
Apache License 2.0
96 stars 56 forks source link

Editing Clusterrole kubeflow-edit doesn't persist for some reason #50

Closed dalbhanj closed 4 years ago

dalbhanj commented 4 years ago

I was following Distributed training tutorial (https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/03_Distributed_Training/03_02_Distributed_Training_MPI_Horovod.ipynb) and ran into issue with unable to persist edited role so that you can add recommended policies

How to reproduce: 1) Edit kubeflow-edit role and add policy in desired namespace kubectl edit clusterrole kubeflow-edit -n kubeflow

# policy to add
- apiGroups:
   - kubeflow.org
   resources:
   - '*'
   verbs:
   - '*'

2) Check back to validate if the policy exist

Workaround solution: Here is the solution suggested by @PatrickXYS Edit kubeflow-kubernetes-edit instead and add recommended policy

kubectl edit clusterrole kubeflow-kubernetes-edit -n kubeflow

Jeffwan commented 4 years ago

@PatrickXYS Can you help check why kubeflow-edit role doesn't work now? Where's the kubeflow-kubernetes-edit from?

Jeffwan commented 4 years ago

This looks like a regression issue. In v0.7.1 at least, kubeflow-edit binds to default-edit and we don't see this problem.

PatrickXYS commented 4 years ago

@Jeffwan From this clusterrole manifest, the kubeflow-kubernetes-edit aggregate to kubeflow-edit clusterrole. And kubeflow-edit has rolebinding with serviceaccount default-edit. That's why editing kubeflow-kubernetes-edit works.

Besides that, I think editing kubeflow-edit also works, but the problem here is we can't directly edit kubeflow-edit clusterrole with:

kubectl edit clusterrole kubeflow-edit -n eksworkshop

After we edited, the rule we edit can't persist when we check it.

I'll investigate why we can't directly edit kubeflow-edit clusterrole.

Jeffwan commented 4 years ago

@PatrickXYS

Then it make sense

Note: If you edit that ClusterRole, your changes will be overwritten on API server restart via auto-reconciliation. To avoid that overwriting, either do not manually edit the role, or disable auto-reconciliation.

kubeflow-edit is a clusterRole with aggregationRules, that's the reason the change can not be persisted. Instead, we have to change kubeflow-kubernetes-edit to persist the changes, It will be aggregate to kubeflow-edit.

Jeffwan commented 4 years ago

There're two action items

  1. Refresh all the workshops to update kubeflow-kubernetes-edit instead
  2. If something is missing from No.1 in upstream, file a PR to update kubeflow-kubernetes-edit.
PatrickXYS commented 4 years ago

@Jeffwan

Right, but instead of editing clusterrole, I think it should be more efficient to create a clusterrolebinding with cluster-admin:

kubectl create clusterrolebinding mpisa --clusterrole=cluster-admin --serviceaccount=$namespace:default-editor

Besides that, I'll add a logic to find current k8s namespace:

def get_current_k8s_namespace():
    """Get the current namespace of kubernetes."""
    with open('/var/run/secrets/kubernetes.io/serviceaccount/namespace', 'r') as f:
        return f.readline()
Jeffwan commented 4 years ago

@Jeffwan

Right, but instead of editing clusterrole, I think it should be more efficient to create a clusterrolebinding with cluster-admin:

kubectl create clusterrolebinding mpisa --clusterrole=cluster-admin --serviceaccount=$namespace:default-editor

There's security issue, if you bind cluter-admin, that user have all the access and can even delete kubeflow namespace. Why do you want to give wide permission to non-admin user?

Besides that, I'll add a logic to find current k8s namespace:

def get_current_k8s_namespace():
    """Get the current namespace of kubernetes."""
    with open('/var/run/secrets/kubernetes.io/serviceaccount/namespace', 'r') as f:
        return f.readline()

What's the motivation? If we don't specify namespace, then sources we create will be inside user's namespace by default.

PatrickXYS commented 4 years ago

@Jeffwan Right, but instead of editing clusterrole, I think it should be more efficient to create a clusterrolebinding with cluster-admin:

kubectl create clusterrolebinding mpisa --clusterrole=cluster-admin --serviceaccount=$namespace:default-editor

There's security issue, if you bind cluter-admin, that user have all the access and can even delete kubeflow namespace. Why do you want to give wide permission to non-admin user?

Okay, I was thinking this way saves users without manually editing clusterrole, and I can delete rolebinding at the end of the notebook. Given security issue we shouldn't do this.

Besides that, I'll add a logic to find current k8s namespace:

def get_current_k8s_namespace():
    """Get the current namespace of kubernetes."""
    with open('/var/run/secrets/kubernetes.io/serviceaccount/namespace', 'r') as f:
        return f.readline()

What's the motivation? If we don't specify namespace, then sources we create will be inside user's namespace by default.

The reason I want to find current k8s namespace is: serviceaccount argument requires <namespace>:<name> format.

Let's do it by the way editing kubeflow-kubernetes-edit, I'll file a PR for updating all the required modification.

Besides that, I'll see if I need to file a PR on upstream to update kubeflow-kubernetes-edit

Jeffwan commented 4 years ago

Yes. @PatrickXYS I understand your concern. I know limiting user's permission is kind of painful for some of the cases. But this is the right way to go. We need to find the gaps to real multi-user kubeflow cluster.

If we use something like cluster-admin, we hide all the problems. If someone ask for restrict settings, we have to start over.

That's same like we use Adminstrator to deploy kubeflow, then user asks minimum IAM policies.

PatrickXYS commented 4 years ago

I investigate upstream kubeflow/mpi-operator repo and found the reason we need to have edited the kubeflow-kubernetes-edit is that mpi-operator clusterrole hasn't been aggregated with kubeflow-edit. Link

An example here is tf-operator, link: it creates a clusterrole which aggregate to kubeflow-edit clusterrole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubeflow-tfjobs-edit
  labels:
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true"
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-tfjobs-admin: "true"
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - tfjobs
  - tfjobs/status
  verbs:
  - get
  - list
  - watch
  - create
  - delete
  - deletecollection
  - patch
  - update

That's why we don't have to edit clusterrole to create tfjobs resource given this notebook.

IMO, I think mpi-operator should enable the same aggregated rule as tf-operator. WDYT @Jeffwan ?

Jeffwan commented 4 years ago

@PatrickXYS There would be a few components like that. Project owners may not know this. Do you want to submit a PR to https://github.com/kubeflow/manifests?

Jeffwan commented 4 years ago

Make sure mxnet-job also have it. I can not remember if xgboost-operator has manifest there. If exist, update it as well

PatrickXYS commented 4 years ago

@Jeffwan Sure, let me check and I'll ping you after I filed a PR.

PatrickXYS commented 4 years ago

Since the PR is merged, and we root cause the issue. Let's close for now.

PatrickXYS commented 4 years ago

xgboost-operator doesn't have manifest. Thus, I'll file PR for mxnet-operator and mpi-operator to make the same change as tf-operator.

Related kubeflow/manifest issue.