NetApp / trident

Storage orchestrator for containers
Apache License 2.0
748 stars 219 forks source link

The Trident operator fails to install via Helm on Rancher #839

Open lindhe opened 1 year ago

lindhe commented 1 year ago

Describe the bug

When installing the Trident operator from the Helm chart in a Kubernetes cluster managed by Rancher, the operator fails because it is unable to add the PSA label pod-security.kubernetes.io/enforce: privileged on its installation namespace. This is because Rancher has a special admission webhook in place for setting PSA labels, which must be granted to the ServiceAccount, on top of all the other RBAC rules it needs.

Environment

To Reproduce

  1. Have a Rancher managed RKE2 cluster (but I'm guessing it'll work with any Rancher managed cluster).
  2. helm repo add netapp-trident https://netapp.github.io/trident-helm-chart
  3. helm install trident netapp-trident/trident-operator --version 23.04.0 --create-namespace --namespace trident
  4. Check the status of the installed CRDs, thetrident TridentOrchestrator object and the pods deployed:

    $ kubectl get crd | grep trident
    tridentorchestrators.trident.netapp.io                            2023-06-28T14:56:46Z
    
    $ kubectl -n trident get pods
    NAME                                 READY    STATUS    RESTARTS    AGE
    trident-operator-5789cf4777-nc4vn    1/1      Runnnig   0           7m32s
    
    $ kubectl -n trident get tridentorchestrators trident -o yaml
     […]
     status:
       message: 'Failed to install Trident; err: failed to patch Trident installation namespace
         trident; admission webhook "rancher.cattle.io.namespaces" denied the request:
         Unauthorized'
       namespace: trident
       status: Failed
       version: ""

Expected behavior

I expect it to deploy as it should and not crash. Here's an example of what it looks like when deploying successfully:

$ kubectl -n trident get pods
NAME                                  READY   STATUS    RESTARTS   AGE
trident-controller-6d7c9c5d8c-wg8zj   6/6     Running   0          4h28m
trident-node-linux-4tk6q              2/2     Running   0          4h28m
trident-node-linux-97rgx              2/2     Running   0          4h28m
trident-node-linux-9jfbh              2/2     Running   0          4h28m
trident-node-linux-btjx6              2/2     Running   0          4h28m
trident-node-linux-n5k75              2/2     Running   0          4h28m
trident-node-linux-vpcgd              2/2     Running   0          4h28m
trident-operator-5789cf4777-66mth     1/1     Running   0          4h29m

$ kubectl get crd | grep trident
tridentbackendconfigs.trident.netapp.io                           2023-07-05T08:09:56Z
tridentbackends.trident.netapp.io                                 2023-07-05T08:09:55Z
tridentmirrorrelationships.trident.netapp.io                      2023-07-05T08:10:00Z
tridentnodes.trident.netapp.io                                    2023-07-05T08:09:58Z
tridentorchestrators.trident.netapp.io                            2023-06-28T14:56:46Z
tridentsnapshotinfos.trident.netapp.io                            2023-07-05T08:09:56Z
tridentsnapshots.trident.netapp.io                                2023-07-05T08:09:59Z
tridentstorageclasses.trident.netapp.io                           2023-07-05T08:09:56Z
tridenttransactions.trident.netapp.io                             2023-07-05T08:09:59Z
tridentversions.trident.netapp.io                                 2023-07-05T08:09:55Z
tridentvolumepublications.trident.netapp.io                       2023-07-05T08:09:57Z
tridentvolumereferences.trident.netapp.io                         2023-07-05T08:10:00Z
tridentvolumes.trident.netapp.io                                  2023-07-05T08:09:57Z

Additional context

This was already reported to Rancher's GitHub page as issue #41191. People (understandably) thought that this was a bug in Rancher, while it's more of a documentation issue on their part (in my opinion).

There's also some information available in the operator's pod logs. I don't have them easily available right now, but it basically amounts to the same message as the one displayed by the TridentOrchestrator object anyway; it fails to patch the trident namespace because the Rancher admission webhook rancher.cattle.io.namespaces denied the request (Unauthorized).

Work-around

Inspired by this comment from the issue reported to Rancher's GitHub page, applying the following manifest and then restarting the operator fixes the issue:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: trident-operator-psa
rules:
- apiGroups:
  - management.cattle.io
  resources:
  - projects
  verbs:
  - updatepsa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: trident-operator-psa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: trident-operator-psa
subjects:
- kind: ServiceAccount
  name: trident-operator
  namespace: trident
nheinemans commented 1 year ago

We're running into the same issue after upgrading from Rancher 2.6.11 to 2.7.5. I can confirm that your workaround fixes the issue.

Philbow commented 1 year ago

@lindhe: Thanks for bringing this up and creating the corresponding pull request. I can confirm as well, that this solves the issue in my cluster.

Does NetApp has a plan to merge this at some point in time? Applying these workarounds in automation is a bit cumbersome and unclean.

nheinemans-asml commented 9 months ago

We're still seeing the same issue in Rancher 2.7.9 and Trident 23.10.0. Can we perhaps get an update from Netapp on this issue and the pending PR?