NetApp / trident

Storage orchestrator for containers
Apache License 2.0
762 stars 222 forks source link

The Trident operator fails to install via Helm on Rancher #839

Open lindhe opened 1 year ago

lindhe commented 1 year ago

Describe the bug

When installing the Trident operator from the Helm chart in a Kubernetes cluster managed by Rancher, the operator fails because it is unable to add the PSA label pod-security.kubernetes.io/enforce: privileged on its installation namespace. This is because Rancher has a special admission webhook in place for setting PSA labels, which must be granted to the ServiceAccount, on top of all the other RBAC rules it needs.

Environment

To Reproduce

  1. Have a Rancher managed RKE2 cluster (but I'm guessing it'll work with any Rancher managed cluster).
  2. helm repo add netapp-trident https://netapp.github.io/trident-helm-chart
  3. helm install trident netapp-trident/trident-operator --version 23.04.0 --create-namespace --namespace trident
  4. Check the status of the installed CRDs, thetrident TridentOrchestrator object and the pods deployed:

    $ kubectl get crd | grep trident
    tridentorchestrators.trident.netapp.io                            2023-06-28T14:56:46Z
    
    $ kubectl -n trident get pods
    NAME                                 READY    STATUS    RESTARTS    AGE
    trident-operator-5789cf4777-nc4vn    1/1      Runnnig   0           7m32s
    
    $ kubectl -n trident get tridentorchestrators trident -o yaml
     […]
     status:
       message: 'Failed to install Trident; err: failed to patch Trident installation namespace
         trident; admission webhook "rancher.cattle.io.namespaces" denied the request:
         Unauthorized'
       namespace: trident
       status: Failed
       version: ""

Expected behavior

I expect it to deploy as it should and not crash. Here's an example of what it looks like when deploying successfully:

$ kubectl -n trident get pods
NAME                                  READY   STATUS    RESTARTS   AGE
trident-controller-6d7c9c5d8c-wg8zj   6/6     Running   0          4h28m
trident-node-linux-4tk6q              2/2     Running   0          4h28m
trident-node-linux-97rgx              2/2     Running   0          4h28m
trident-node-linux-9jfbh              2/2     Running   0          4h28m
trident-node-linux-btjx6              2/2     Running   0          4h28m
trident-node-linux-n5k75              2/2     Running   0          4h28m
trident-node-linux-vpcgd              2/2     Running   0          4h28m
trident-operator-5789cf4777-66mth     1/1     Running   0          4h29m

$ kubectl get crd | grep trident
tridentbackendconfigs.trident.netapp.io                           2023-07-05T08:09:56Z
tridentbackends.trident.netapp.io                                 2023-07-05T08:09:55Z
tridentmirrorrelationships.trident.netapp.io                      2023-07-05T08:10:00Z
tridentnodes.trident.netapp.io                                    2023-07-05T08:09:58Z
tridentorchestrators.trident.netapp.io                            2023-06-28T14:56:46Z
tridentsnapshotinfos.trident.netapp.io                            2023-07-05T08:09:56Z
tridentsnapshots.trident.netapp.io                                2023-07-05T08:09:59Z
tridentstorageclasses.trident.netapp.io                           2023-07-05T08:09:56Z
tridenttransactions.trident.netapp.io                             2023-07-05T08:09:59Z
tridentversions.trident.netapp.io                                 2023-07-05T08:09:55Z
tridentvolumepublications.trident.netapp.io                       2023-07-05T08:09:57Z
tridentvolumereferences.trident.netapp.io                         2023-07-05T08:10:00Z
tridentvolumes.trident.netapp.io                                  2023-07-05T08:09:57Z

Additional context

This was already reported to Rancher's GitHub page as issue #41191. People (understandably) thought that this was a bug in Rancher, while it's more of a documentation issue on their part (in my opinion).

There's also some information available in the operator's pod logs. I don't have them easily available right now, but it basically amounts to the same message as the one displayed by the TridentOrchestrator object anyway; it fails to patch the trident namespace because the Rancher admission webhook rancher.cattle.io.namespaces denied the request (Unauthorized).

Work-around

Inspired by this comment from the issue reported to Rancher's GitHub page, applying the following manifest and then restarting the operator fixes the issue:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: trident-operator-psa
rules:
- apiGroups:
  - management.cattle.io
  resources:
  - projects
  verbs:
  - updatepsa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: trident-operator-psa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: trident-operator-psa
subjects:
- kind: ServiceAccount
  name: trident-operator
  namespace: trident
nheinemans commented 1 year ago

We're running into the same issue after upgrading from Rancher 2.6.11 to 2.7.5. I can confirm that your workaround fixes the issue.

Philbow commented 1 year ago

@lindhe: Thanks for bringing this up and creating the corresponding pull request. I can confirm as well, that this solves the issue in my cluster.

Does NetApp has a plan to merge this at some point in time? Applying these workarounds in automation is a bit cumbersome and unclean.

nheinemans-asml commented 11 months ago

We're still seeing the same issue in Rancher 2.7.9 and Trident 23.10.0. Can we perhaps get an update from Netapp on this issue and the pending PR?

lindhe commented 2 weeks ago

@nheinemans-asml Could you try with v24.10.0? It's apparently resolved there, but I have no idea which PR that was.

betweenclouds commented 2 weeks ago

@lindhe I tested it with Rancher v2.9.2 and trident 24.10.0 is still an issue. After applying the workaround it suceeds:

kubectl describe torc trident 

Events:
  Type     Reason      Age                  From                        Message
  ----     ------      ----                 ----                        -------
  Normal   Installing  16m                  trident-operator.netapp.io  Installing Trident
  Warning  Failed      3m45s (x6 over 16m)  trident-operator.netapp.io  Failed to install Trident; err: failed to patch Trident installation namespace netapp-trident; admission webhook "rancher.cattle.io.namespaces" denied the request: Unauthorized
  Normal   Installed   27s                  trident-operator.netapp.io  Trident installed
sjpeeris commented 1 week ago

Hi @betweenclouds This should have been fixed in 24.10.0 as part of https://github.com/NetApp/trident/commit/5824103a201cb2f1be13f9435e554ad160c829b3

Can you try setting the forceInstallRancherClusterRoles: true in helm/trident-operator/values.yaml

betweenclouds commented 1 week ago

@sjpeeris Thank you, with forceInstallRancherClusterRoles=true the installation is sucessful, but only if I create a namespace named trident. Is this a expected behavior?

works:

helm install netapp-trident netapp-trident/trident-operator --version 100.2410.0 --create-namespace --namespace trident --set tridentDebug=true --set forceInstallRancherClusterRoles=true

does not work:

helm install netapp-trident netapp-trident/trident-operator --version 100.2410.0 --create-namespace --namespace netapp-trident --set tridentDebug=true --set forceInstallRancherClusterRoles=true

edit:

Namespace is hard-coded here: https://github.com/NetApp/trident/blob/master/helm/trident-operator/templates/clusterrolebinding-rancher.yaml#L13

instead of a variable like here: https://github.com/NetApp/trident/blob/master/helm/trident-operator/templates/clusterrolebinding.yaml#L10

jharrod commented 1 week ago

Hi @betweenclouds, you are correct. That namespace shouldn't be hard-coded. We will have this fixed in the next release. Thanks for pointing that out.