elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
54 stars 707 forks source link

elastic-internal-init-filesystem does not prepare the data directory #6156

Open dobharweim opened 2 years ago

dobharweim commented 2 years ago

Bug Report

What did you do?

I Installed the quickstart elasticsearch cluster from the docs to an namespace managed by Operator version 2.5.0.

What did you expect to see?

I expected the elasticsearch pod to start with an initContainer elastic-internal-init-filesystem which would prepare the mounted PVC for the data directory (elasticsearch-data) with the correct ownership and octal permissions.

What did you see instead? Under which circumstances?

Instead the elastic-internal-init-filesystem container does not update the volume mount and therefore it is unwritable. ES fails with the following error (logs from elastic-internal-init-filesystem below):

k logs quickstart-es-default-0

{"@timestamp":"2022-11-09T12:37:41.381Z", "log.level":"ERROR", "message":"fatal exception while booting Elasticsearch", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.bootstrap.Elasticsearch","elasticsearch.node.name":"quickstart-es-default-0","elasticsearch.cluster.name":"quickstart","error.type":"java.lang.IllegalStateException","error.message":"failed to obtain node locks, tried [/usr/share/elasticsearch/data]; maybe these locations are not writable or multiple nodes were started on the same data path?","error.stack_trace":"java.lang.IllegalStateException: failed to obtain node locks, tried [/usr/share/elasticsearch/data]; maybe these locations are not writable or multiple nodes were started on the same data path?\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.env.NodeEnvironment.(NodeEnvironment.java:285)\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.node.Node.(Node.java:469)\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.node.Node.(Node.java:316)\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.bootstrap.Elasticsearch$2.(Elasticsearch.java:214)\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:214)\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:67)\nCaused by: java.io.IOException: failed to obtain lock on /usr/share/elasticsearch/data\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.env.NodeEnvironment$NodeLock.(NodeEnvironment.java:230)\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.env.NodeEnvironment$NodeLock.(NodeEnvironment.java:198)\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.env.NodeEnvironment.(NodeEnvironment.java:277)\n\t... 5 more\nCaused by: java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/node.lock\n\tat java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)\n\tat java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)\n\tat java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)\n\tat java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:825)\n\tat org.apache.lucene.core@9.4.1/org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:94)\n\tat org.apache.lucene.core@9.4.1/org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:43)\n\tat org.apache.lucene.core@9.4.1/org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:44)\n\tat org.elasticsearch.server@8.5.0/org.elasticsearch.env.NodeEnvironment$NodeLock.(NodeEnvironment.java:223)\n\t... 7 more\n\tSuppressed: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/node.lock\n\t\tat java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)\n\t\tat java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)\n\t\tat java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)\n\t\tat java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218)\n\t\tat java.base/java.nio.file.Files.newByteChannel(Files.java:380)\n\t\tat java.base/java.nio.file.Files.createFile(Files.java:658)\n\t\tat org.apache.lucene.core@9.4.1/org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:84)\n\t\t... 10 more\n"} ERROR: Elasticsearch did not exit normally - check the logs at /usr/share/elasticsearch/logs/quickstart.log {"timestamp": "2022-11-09T12:37:41+00:00", "message": "readiness probe failed", "curl_rc": "7"}

ERROR: Elasticsearch exited unexpectedly

k logs quickstart-es-default-0 elastic-internal-init-filesystem

Starting init script Linking /mnt/elastic-internal/xpack-file-realm/users to /usr/share/elasticsearch/config/users Linking /mnt/elastic-internal/xpack-file-realm/roles.yml to /usr/share/elasticsearch/config/roles.yml Linking /mnt/elastic-internal/xpack-file-realm/users_roles to /usr/share/elasticsearch/config/users_roles Linking /mnt/elastic-internal/elasticsearch-config/elasticsearch.yml to /usr/share/elasticsearch/config/elasticsearch.yml Linking /mnt/elastic-internal/unicast-hosts/unicast_hosts.txt to /usr/share/elasticsearch/config/unicast_hosts.txt Linking /mnt/elastic-internal/xpack-file-realm/service_tokens to /usr/share/elasticsearch/config/service_tokens File linking duration: 0 sec. Copying /usr/share/elasticsearch/config/ to /mnt/elastic-internal/elasticsearch-config-local/ '/usr/share/elasticsearch/config/elasticsearch-plugins.example.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/elasticsearch-plugins.example.yml' '/usr/share/elasticsearch/config/elasticsearch.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/elasticsearch.yml' '/usr/share/elasticsearch/config/http-certs' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs' '/usr/share/elasticsearch/config/http-certs/..2022_11_09_12_24_31.1370307674' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2022_11_09_12_24_31.1370307674' '/usr/share/elasticsearch/config/http-certs/..2022_11_09_12_24_31.1370307674/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2022_11_09_12_24_31.1370307674/ca.crt' '/usr/share/elasticsearch/config/http-certs/..2022_11_09_12_24_31.1370307674/tls.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2022_11_09_12_24_31.1370307674/tls.crt' '/usr/share/elasticsearch/config/http-certs/..2022_11_09_12_24_31.1370307674/tls.key' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2022_11_09_12_24_31.1370307674/tls.key' '/usr/share/elasticsearch/config/http-certs/..data' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..data' '/usr/share/elasticsearch/config/http-certs/tls.key' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.key' '/usr/share/elasticsearch/config/http-certs/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/ca.crt' '/usr/share/elasticsearch/config/http-certs/tls.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.crt' '/usr/share/elasticsearch/config/jvm.options' -> '/mnt/elastic-internal/elasticsearch-config-local/jvm.options' '/usr/share/elasticsearch/config/jvm.options.d' -> '/mnt/elastic-internal/elasticsearch-config-local/jvm.options.d' '/usr/share/elasticsearch/config/log4j2.file.properties' -> '/mnt/elastic-internal/elasticsearch-config-local/log4j2.file.properties' '/usr/share/elasticsearch/config/log4j2.properties' -> '/mnt/elastic-internal/elasticsearch-config-local/log4j2.properties' '/usr/share/elasticsearch/config/role_mapping.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/role_mapping.yml' '/usr/share/elasticsearch/config/roles.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/roles.yml' '/usr/share/elasticsearch/config/service_tokens' -> '/mnt/elastic-internal/elasticsearch-config-local/service_tokens' '/usr/share/elasticsearch/config/transport-remote-certs' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs' '/usr/share/elasticsearch/config/transport-remote-certs/..2022_11_09_12_24_31.2343229866' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..2022_11_09_12_24_31.2343229866' '/usr/share/elasticsearch/config/transport-remote-certs/..2022_11_09_12_24_31.2343229866/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..2022_11_09_12_24_31.2343229866/ca.crt' '/usr/share/elasticsearch/config/transport-remote-certs/..data' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..data' '/usr/share/elasticsearch/config/transport-remote-certs/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/ca.crt' '/usr/share/elasticsearch/config/unicast_hosts.txt' -> '/mnt/elastic-internal/elasticsearch-config-local/unicast_hosts.txt' '/usr/share/elasticsearch/config/users' -> '/mnt/elastic-internal/elasticsearch-config-local/users' '/usr/share/elasticsearch/config/users_roles' -> '/mnt/elastic-internal/elasticsearch-config-local/users_roles' Empty dir /usr/share/elasticsearch/plugins Copying /usr/share/elasticsearch/bin/ to /mnt/elastic-internal/elasticsearch-bin-local/ '/usr/share/elasticsearch/bin/elasticsearch' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch' '/usr/share/elasticsearch/bin/elasticsearch-certgen' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-certgen' '/usr/share/elasticsearch/bin/elasticsearch-certutil' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-certutil' '/usr/share/elasticsearch/bin/elasticsearch-cli' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-cli' '/usr/share/elasticsearch/bin/elasticsearch-create-enrollment-token' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-create-enrollment-token' '/usr/share/elasticsearch/bin/elasticsearch-croneval' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-croneval' '/usr/share/elasticsearch/bin/elasticsearch-env' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-env' '/usr/share/elasticsearch/bin/elasticsearch-env-from-file' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-env-from-file' '/usr/share/elasticsearch/bin/elasticsearch-geoip' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-geoip' '/usr/share/elasticsearch/bin/elasticsearch-keystore' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-keystore' '/usr/share/elasticsearch/bin/elasticsearch-node' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-node' '/usr/share/elasticsearch/bin/elasticsearch-plugin' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-plugin' '/usr/share/elasticsearch/bin/elasticsearch-reconfigure-node' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-reconfigure-node' '/usr/share/elasticsearch/bin/elasticsearch-reset-password' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-reset-password' '/usr/share/elasticsearch/bin/elasticsearch-saml-metadata' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-saml-metadata' '/usr/share/elasticsearch/bin/elasticsearch-service-tokens' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-service-tokens' '/usr/share/elasticsearch/bin/elasticsearch-setup-passwords' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-setup-passwords' '/usr/share/elasticsearch/bin/elasticsearch-shard' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-shard' '/usr/share/elasticsearch/bin/elasticsearch-sql-cli' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-sql-cli' '/usr/share/elasticsearch/bin/elasticsearch-sql-cli-8.5.0.jar' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-sql-cli-8.5.0.jar' '/usr/share/elasticsearch/bin/elasticsearch-syskeygen' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-syskeygen' '/usr/share/elasticsearch/bin/elasticsearch-users' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-users' Files copy duration: 0 sec. chown duration: 0 sec. waiting for the transport certificates (/mnt/elastic-internal/transport-certificates/quickstart-es-default-0.tls.key) wait duration: 1 sec. Linking /usr/share/elasticsearch/config/transport-certs/quickstart-es-default-0.tls.crt to /mnt/elastic-internal/elasticsearch-config-local/node-transport-cert/transport.tls.crt Linking /usr/share/elasticsearch/config/transport-certs/quickstart-es-default-0.tls.crt to /mnt/elastic-internal/elasticsearch-config-local/node-transport-cert/transport.tls.crt Certs linking duration: 0 sec. Init script successful Script duration: 1 sec.

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.5.0
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false
EOF
dobharweim commented 2 years ago

I've gotten a workaround for this issue. I add a securityContext to run the initContainer as root and it seems to detect this and run the chown step.

New Manifest

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.5.0
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false
    podTemplate:
      spec:
        securityContext:
          fsGroup: 1000
          runAsUser: 1000
          runAsGroup: 0
        initContainers:
        - name: elastic-internal-init-filesystem
          securityContext:
            runAsUser: 0
            runAsGroup: 0
EOF
jeanfabrice commented 1 year ago

The set-default-security-context ECK parameter, which defaults to true, is responsible for automatically adding fsGroup: 1000 to the elasticsearch pod's securityContext, in order to make Kubernetes automatically change ownership on the data volume (see https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods)

Can you double check the value you are using?

Next, DelegateFSGroupToCSIDriver is a K8s feature-gate which delegates the ownership change to the CSI driver. It was alpha / false up to Kubernetes 1.22 and is now beta / true since Kubernetes 1.23. You should validate your CSI driver doesn't have any known issue regarding this feature (some has from my personal experience). Using Kubernetes 1.23+, you can still force this feature-gate to false on the various Kubernetes components

dobharweim commented 1 year ago

Hi @jeanfabrice my apologies for the delay in responding, thank you for your input and direction.

ECK is installed with chart default settings as I should have outlined in the original issue.

DelegateFSGroupToCSIDriver is enabled on my cluster, I don't know of any known issues with the feature with my provider. Do you have any info on the usual types of issues/what I could search for or try to reproduce in this area? Thanks.

jeanfabrice commented 1 year ago

Hey @dobharweim! I would first check whether set-default-security-context is enabled or not from an ECK perspective. If it is, your elasticsearch pods should normally have the securityContext.fsGroup: 1000 automatically configured.

To determine whether or not your CSI driver has an issue with DelegateFSGroupToCSIDriver, you can certainly spin a busybox pod with securityContext.fsGroup: 1000 plus a mounted PVC, then see whether the PVC content is getting updated with group: 1000 ownership or not. If it is not, then the delegation is at fault. If it is, then it should work the same with elasticsearch pods.

barkbay commented 1 year ago

I expected the elasticsearch pod to start with an initContainer elastic-internal-init-filesystem which would prepare the mounted PVC for the data directory (elasticsearch-data) with the correct ownership and octal permissions.

Setting permissions requires the init container to run as root, which is not the case by default. As stated by the K8S documentation setting a fsGroup in the Pod securityContext should set the expected permissions without running a container with runAsGroup: 0:

By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a Pod's securityContext when that volume is mounted.

For example:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  # uncomment the lines below to copy the specified node labels as pod annotations and use it as an environment variable in the Pods
  #annotations:
  #  eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone"
  name: elasticsearch-sample
spec:
  version: 8.5.0
  nodeSets:
  - name: default
    config:
      node.store.allow_mmap: false
    podTemplate:
      spec:
        securityContext:
          runAsUser: 3000
          runAsGroup: 0
          fsGroup: 3000

The set-default-security-context ECK parameter, which defaults to true, is responsible for automatically adding fsGroup: 1000 to the elasticsearch pod's securityContext,

Good point, but I think the doc is not up-to-date and the default value is auto-detect (detection mechanism here) since 2.5.0 (see https://github.com/elastic/cloud-on-k8s/pull/5150/files) I have no idea how it behaves on IBM Kubernetes Service? Is it a "flavor" of OpenShift?

usersina commented 9 months ago

Exact same issue with a local minikube cluster using ECK version 2.11.1, can be easily reproduced through a PersistentVolume as follows:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: manual-pv-1
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data/manual-pv-1"
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.12.0
  nodeSets:
    - name: default
      count: 1
      podTemplate:
        spec:
          # Uncomment to fix the issue
          #
          # securityContext:
          #   fsGroup: 1000
          #   runAsUser: 1000
          #   runAsGroup: 0
          # initContainers:
          #   - name: elastic-internal-init-filesystem
          #     securityContext:
          #       runAsUser: 0
          #       runAsGroup: 0
          containers:
            - name: elasticsearch
              resources:
                requests:
                  memory: 2Gi
                  cpu: 2
                limits:
                  memory: 4Gi
                  cpu: 8
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 5Gi
            storageClassName: manual