cockroachdb / cockroach-operator

k8s operator for CRDB
Apache License 2.0
282 stars 95 forks source link

Permission denied creating the data directory #1016

Open michel-zimmer opened 8 months ago

michel-zimmer commented 8 months ago

I'm stuck with the following error when trying to create any kind of CockroachDB cluster using the operator:

E240215 20:18:40.885312 1 1@cli/clierror/check.go:35  [-] 1  ERROR: connection lost.
E240215 20:18:40.885312 1 1@cli/clierror/check.go:35  [-] 1 +creating data directory: mkdir /cockroach/cockroach-data/auxiliary: permission denied
ERROR: connection lost.

creating data directory: mkdir /cockroach/cockroach-data/auxiliary: permission denied
Failed running "start"

The cluster manifest might look like this:

apiVersion: crdb.cockroachlabs.com/v1alpha1
kind: CrdbCluster
metadata:
  name: primary-crdb
spec:
  cockroachDBVersion: v23.1.11
  dataStore:
    pvc:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: "1Gi"
        storageClassName: primary-nfs
        volumeMode: Filesystem
  nodes: 3
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 2Gi
  tlsEnabled: true

The storage class is for csi-driver-nfs and leads to the following directory tree:

$ ls -lahF /<nfs-csi-dir>/*
/<nfs-csi-dir>/pvc-40b518b5-bccc-4610-b804-0bd2175f5eed:
total 18K
drwxrwsr-x 2 root 1000581000 2 Feb 11 16:26 ./
drwxr-xr-x 6 root root       6 Feb 11 17:07 ../

The CockroachDB pod manifest (kubectl get pods primary-crdb-0 --output yaml) has the following security context:

securityContext:
  fsGroup: 1000581000
  runAsUser: 1000581000

Which explains why the permissions actually don't add up.

For comparison, using this storage setup, it is possible to create a working mount like this:

...
containers:
  - name: busybox
    image: busybox:1.28
    command: [ "sh", "-c", "sleep 1h" ]
    volumeMounts:
      - name: data
        mountPath: "/test"
securityContext:
  runAsUser: 2000
  runAsGroup: 2000
  fsGroup: 2000
volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test

When creating a file (touch /test/file) from inside the container the directory tree looks like this:

$ ls -lahF /<nfs-csi-dir>/*
/<nfs-csi-dir>/pvc-730e175e-af46-4e48-b4e4-5a1dd568307d:
total 19K
drwxrwsr-x 2 root 2000 3 Feb 11 17:14 ./
drwxr-xr-x 6 root root 6 Feb 11 17:07 ../
-rw-rw-r-- 1 2000 2000 0 Feb 11 17:14 file

It works because all owner and group match.

I'm wondering if the operator should specify runAsGroup or if there is something unusual with my setup, and if this should not be necessary at all.

The locations in the code would be the following:

Even though I don't have much experience in self-hosting storage for Kubernetes, I would say adding runAsGroup is the right idea and I'm happy to create a PR if wanted.

Fred-Ko commented 6 months ago

same issue