OOMKilled issue using example cluster

EwanValentine commented 3 years ago

Hey there,

I'm attempting to use the operator on a Kubernetes cluster (EKS). I'm using a couple of m5.large EC2 instances, but I'm getting an OOMKilled error. Here's my config:

apiVersion: "druid.apache.org/v1alpha1"
kind: "Druid"
metadata:
  name: druid-cluster
spec:
  image: apache/druid:0.21.0
  startScript: /druid.sh
  serviceAccount: "druid-scaling-spike"
  nodeSelector:
    service: ewanstenant-druid
  tolerations:
    - key: 'dedicated'
      operator: 'Equal'
      value: 'ewanstenant-druid'
      effect: 'NoSchedule'
  readinessProbe:
    httpGet:
      path: /status/health
      port: 8088
  securityContext:
    fsGroup: 1000
    runAsUser: 1000
    runAsGroup: 1000
  services:
    - spec:
        type: ClusterIP
        clusterIP: None
  commonConfigMountPath: "/opt/druid/var/conf/druid/cluster/_common"
  jvm.options: |-
    -server
    -XX:MaxDirectMemorySize=10240g
    -Duser.timezone=UTC
    -Dfile.encoding=UTF-8
    -Dlog4j.debug
    -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
    -Djava.io.tmpdir=/opt/druid/var/tmp
  log4j.config: |-
    <?xml version="1.0" encoding="UTF-8" ?>
    <Configuration status="WARN">
        <Appenders>
            <Console name="Console" target="SYSTEM_OUT">
                <PatternLayout pattern="%d{ISO8601} %p [%t] %c - %m%n"/>
            </Console>
        </Appenders>
        <Loggers>
            <Root level="info">
                <AppenderRef ref="Console"/>
            </Root>
        </Loggers>
    </Configuration>
  common.runtime.properties: |

    # Zookeeper
    druid.zk.service.enabled=false
    druid.zk.service.host=tiny-cluster-zk.druid.svc.cluster.local
    druid.zk.paths.base=/druid
    druid.zk.service.compress=false

    # Metadata Store
    druid.metadata.storage.type=derby
    druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/druid/data/derbydb/metadata.db;create=true
    druid.metadata.storage.connector.host=localhost
    druid.metadata.storage.connector.port=1527
    druid.metadata.storage.connector.createTables=true

    #
    # Extensions
    #
    druid.extensions.loadList=["postgresql-metadata-storage", "druid-s3-extensions", "druid-kafka-indexing-service"]

    # Deep Storage
    druid.storage.type=s3
    druid.storage.bucket=druid-scaling-spike-deepstore
    druid.storage.baseKey=druid
    druid.storage.sse.type=s3
    druid.storage.disableAcl=true

    #
    # Service discovery
    #
    druid.selectors.indexing.serviceName=druid/overlord
    druid.selectors.coordinator.serviceName=druid/coordinator

    druid.indexer.logs.type=file
    druid.indexer.logs.directory=/opt/druid/var/data/indexing-logs
    druid.lookup.enableLookupSyncOnStartup=false
  volumeMounts:
    - mountPath: /opt/druid/var
      name: data-volume
  volumes:
    - name: data-volume
      emptyDir: {}
  env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          fieldPath: metadata.namespace

  nodes:
    brokers:
      # Optionally specify for running broker as Deployment
      kind: Deployment
      nodeType: "broker"
      # Optionally specify for broker nodes
      # imagePullSecrets:
      # - name: tutu
      druid.port: 8088
      nodeConfigMountPath: "/opt/druid/var/conf/druid/cluster/query/broker"
      replicas: 1
      runtime.properties: |
        druid.service=druid/broker

        # HTTP server threads
        druid.broker.http.numConnections=5
        druid.server.http.numThreads=10

        # Processing threads and buffers
        druid.processing.buffer.sizeBytes=1
        druid.processing.numMergeBuffers=1
        druid.processing.numThreads=1
        druid.sql.enable=true
      extra.jvm.options: |-
        -Xmx512M
        -Xms512M
      ingressAnnotations:
          "nginx.ingress.kubernetes.io/rewrite-target": "/"
      ingress:
        rules:
          - http:
              paths:
              - path: /brokers
                backend:
                  serviceName: druid-druid-cluster-brokers
                  servicePort: 8088
      hpAutoscaler:
        maxReplicas: 3
        minReplicas: 1
        scaleTargetRef:
          apiVersion: apps/v1
          kind: StatefulSet
          name: druid-tiny-cluster-brokers
        metrics:
        - type: Resource
          resource:
            name: cpu
            targetAverageUtilization: 50
      resources:
        requests:
          memory: 1Gi
          cpu: 256m
        limits:
          memory: 4Gi
          cpu: 512m

    coordinators:
      # Optionally specify for running coordinator as Deployment
      kind: Deployment
      nodeType: "coordinator"
      druid.port: 8088
      nodeConfigMountPath: "/opt/druid/var/conf/druid/cluster/master/coordinator-overlord"
      replicas: 1
      runtime.properties: |
        druid.service=druid/coordinator

        # HTTP server threads
        druid.coordinator.startDelay=PT30S
        druid.coordinator.period=PT30S

        # Configure this coordinator to also run as Overlord
        druid.coordinator.asOverlord.enabled=true
        druid.coordinator.asOverlord.overlordService=druid/overlord
        druid.indexer.queue.startDelay=PT30S
        druid.indexer.runner.type=local
      extra.jvm.options: |-
        -Xmx512M
        -Xms512M
      resources:
        limits:
          cpu: 1
          memory: 4Gi
        requests:
          cpu: 256m
          memory: 256Mi

    historicals:
      kind: Deployment
      nodeType: "historical"
      druid.port: 8088
      nodeConfigMountPath: "/opt/druid/var/conf/druid/cluster/data/historical"
      replicas: 1
      runtime.properties: |
        druid.service=druid/historical
        druid.server.http.numThreads=5
        druid.processing.buffer.sizeBytes=536870912
        druid.processing.numMergeBuffers=1
        druid.processing.numThreads=1
        # Segment storage
        druid.segmentCache.locations=[{\"path\":\"/opt/druid/var/data/segments\",\"maxSize\":10737418240}]
        druid.server.maxSize=10737418240
      extra.jvm.options: |-
        -Xmx512M
        -Xms512M
      resources:
        limits:
          cpu: 512m
          memory: 4Gi
        requests:
          cpu: 256m
          memory: 1Gi
      hpAutoscaler:
        maxReplicas: 1
        minReplicas: 1
        scaleTargetRef:
           apiVersion: apps/v1
           kind: StatefulSet
           name: druid-tiny-cluster-historicals
        metrics:
        - type: Resource
          resource:
            name: cpu
            targetAverageUtilization: 50

    routers:
      kind: Deployment
      nodeType: "router"
      druid.port: 8088
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/router"
      replicas: 1
      runtime.properties: |
        druid.service=druid/router

        # HTTP proxy
        druid.router.http.numConnections=10
        druid.router.http.readTimeout=PT5M
        druid.router.http.numMaxThreads=10
        druid.server.http.numThreads=10

        # Service discovery
        druid.router.defaultBrokerServiceName=druid/broker
        druid.router.coordinatorServiceName=druid/coordinator

        # Management proxy to coordinator / overlord: required for unified web console.
        druid.router.managementProxy.enabled=true       
      extra.jvm.options: |-
        -Xmx512M
        -Xms512M
      ingressAnnotations:
          "nginx.ingress.kubernetes.io/rewrite-target": "/"
      ingress:
        rules:
          - http:
              paths:
              - path: /routers
                backend:
                  serviceName: druid-druid-cluster-routers
                  servicePort: 8088
      hpAutoscaler:
        maxReplicas: 1
        minReplicas: 1
        scaleTargetRef:
           apiVersion: apps/v1
           kind: StatefulSet
           name: druid-tiny-cluster-routers
        metrics:
        - type: Resource
          resource:
            name: cpu
            targetAverageUtilization: 50
      resources:
        limits:
          cpu: 512m
          memory: 4Gi
        requests:
          cpu: 256m
          memory: 512Mi

Appreciate any advice as to what the correct resource limits should be. Also, apologies if this is a more general request. Although I've not had the same problem with the helm chart, so wondered if I was missing something with the way the operator works around resources. Cheers!

AdheipSingh commented 3 years ago

Which nodeType is crashing out of the above nodes ? can you do a kubectl get pods <name_of_pod> -n <namespace> -o yaml and capture the exit code and error. Should be 137 i guess for oom. Just to confirm is it a k8s oom or java oome ? You might need to tune your jvm XMX and XMS accordingly with resources in k8s.

EwanValentine commented 3 years ago

Here we go (I've redacted or removed a couple of bits, but hopefully this is enough info):

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2021-05-04T17:54:50Z"
  generateName: druid-druid-cluster-historicals-6dd648bdf5-
  labels:
    app: druid
    druid_cr: druid-cluster
    nodeSpecUniqueStr: druid-druid-cluster-historicals
    pod-template-hash: 6dd648bdf5
  name: druid-druid-cluster-historicals-6dd648bdf5-jkxj2
  namespace: ewanstenant
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: druid-druid-cluster-historicals-6dd648bdf5
    uid: c0b695fd-3d48-4e75-85f0-4fe6c18f58e5
  resourceVersion: "101032843"
spec:
  affinity: {}
  containers:
  - command:
    - /druid.sh
    - historical
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    image: apache/druid:0.21.0
    imagePullPolicy: IfNotPresent
    name: druid-druid-cluster-historicals
    ports:
    - containerPort: 8088
      name: druid-port
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /status/health
        port: 8088
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /opt/druid/var/conf/druid/cluster/_common
      name: common-config-volume
      readOnly: true
    - mountPath: /opt/druid/var/conf/druid/cluster/data/historical
      name: nodetype-config-volume
      readOnly: true
    - mountPath: /opt/druid/var
      name: data-volume
    - mountPath:xxxxxxxxxxx
      name: druid-scaling-spike-token-pwwcj
      readOnly: true
    - mountPath: xxxxxxxxxx
      name: xxxxxxx
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: xxxxxxxx
  nodeSelector:
    service: ewanstenant-druid
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000
    runAsGroup: 1000
    runAsUser: 1000
  serviceAccount: druid-scaling-spike
  serviceAccountName: druid-scaling-spike
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: ewanstenant-druid
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: xxxxxx
    projected:
      defaultMode: xxx
      sources:
      - serviceAccountToken:
          audience: xxxxxxxx
          expirationSeconds: xxxxxx
          path: xxxxxx
  - configMap:
      defaultMode: 420
      name: druid-cluster-druid-common-config
    name: common-config-volume
  - configMap:
      defaultMode: 420
      name: druid-druid-cluster-historicals-config
    name: nodetype-config-volume
  - emptyDir: {}
    name: data-volume
  - name: druid-scaling-spike-token-pwwcj
    secret:
      defaultMode: 420
      secretName: druid-scaling-spike-token-pwwcj
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-05-04T17:56:20Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-05-04T17:56:20Z"
    message: 'containers with unready status: [druid-druid-cluster-historicals]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-05-04T17:56:20Z"
    message: 'containers with unready status: [druid-druid-cluster-historicals]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-05-04T17:56:20Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://b177f6f11c694bc19671bfebb50f9abba23b6f80f066e8c26cad50d70c0c7461
    image: apache/druid:0.21.0
    imageID: docker-pullable://apache/druid@sha256:e4e60c6c0a0bfa2a06b9d02e753533fb1f8ecffd7958de350e1118741c4dce5c
    lastState:
      terminated:
        containerID: docker://b177f6f11c694bc19671bfebb50f9abba23b6f80f066e8c26cad50d70c0c7461
        exitCode: 137
        finishedAt: "2021-05-04T18:03:42Z"
        reason: OOMKilled
        startedAt: "2021-05-04T18:03:31Z"
    name: druid-druid-cluster-historicals
    ready: false
    restartCount: 6
    state:
      waiting:
        message: Back-off 5m0s restarting failed container=druid-druid-cluster-historicals
          pod=druid-druid-cluster-historicals-6dd648bdf5-jkxj2_ewanstenant(51a3f4ae-f9ab-470b-89e4-2424a03ccbac)
        reason: CrashLoopBackOff
  hostIP: 172.35.186.210
  phase: Running
  podIP: 172.35.178.243
  podIPs:
  - ip: 172.35.178.243
  qosClass: BestEffort
  startTime: "2021-05-04T17:56:20Z"

They're all doing it apart from routers. I've tried removing resources entirely, but still the same strangely. Yes it looks like a K8s OOM, could be wrong though

Cheers!

AdheipSingh commented 3 years ago

not an expert in druid configurations, buts here what caught my eye

 druid.segmentCache.locations=[{\"path\":\"/opt/druid/var/data/segments\",\"maxSize\":10737418240}]
 druid.server.maxSize=10737418240

this is a pretty large number, i dont see any pvc attached for it too, do you mind lowering these values and checking. BTW try to refer to these configurations https://github.com/apache/druid/blob/master/examples/conf/druid/cluster/data/historical/runtime.properties

EwanValentine commented 3 years ago

I've decreased this, set the memory limit to 4Gi, and the requests as 1Gi. I just get OOM immediately from K8s unfortunately. This is the last thing I see in the logs:

2021-05-05T12:11:03,426 INFO [main] org.apache.druid.server.initialization.jetty.JettyServerModule - Creating http connector with port [8083]
2021-05-05T12:11:03,828 WARN [main] org.eclipse.jetty.server.handler.gzip.GzipHandler - minGzipSize of 0 is inefficient for short content, break even is size 23
2021-05-05T12:11:04,130 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[0] of size[524,288,000]
2021-05-05T12:11:04,810 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[1] of size[524,288,000]
2021-05-05T12:11:05,499 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[2] of size[524,288,000]
2021-05-05T12:11:05,972 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[3] of size[524,288,000]
2021-05-05T12:11:06,272 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[4] of size[524,288,000]
2021-05-05T12:11:06,574 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[5] of size[524,288,000]
2021-05-05T12:11:06,874 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[6] of size[524,288,000]
2021-05-05T12:11:07,176 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[7] of size[524,288,000]

We're attempting to use the S3 deep storage, as we had permission issues on K8s/EKS with EFS unfortunately. And s3 deep storage seemed to get around that

EwanValentine commented 3 years ago

[UPDATE] - If I set the memory limit on historicals to 16Gi, then it seems to work, which is okay for now! I'll try and figure out a way to bring it down.

EwanValentine commented 3 years ago

I'm not sure if this is related, but I also get:

2021-05-05T15:59:01,999 INFO [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-05-05T15:59:01,999 INFO [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Socket error occurred: localhost/127.0.0.1:2181: Connection refused

Although I've configured Zookeeper in the runtime properties:

# Zookeeper
druid.zk.service.host=tiny-cluster-zk-0.tiny-cluster-zk
druid.zk.paths.base=/druid
druid.zk.service.compress=false

AdheipSingh commented 3 years ago

If this helps you can refer to some configurations over here https://gist.github.com/AdheipSingh/2d3ed8fcd3b57a5b3e1a01f0bda3ba27

Regarding zookeeper, you can exec into any of your druid pods, and try to telnet to the svc and debug :)

EwanValentine commented 3 years ago

Ahhh this is awesome, thank you so much!

I got a little further, I realised it wasn't picking up the zookeeper host because it wasn't picking up the shared runtime properties, which led me back to the permissions error. Then found another issue where someone set the runAsUser to 0 as a temporary workaround, which seems to have done the trick! But I'll go through your example as well. Appreciate all your help :)

AdheipSingh commented 3 years ago

@EwanValentine druid runs as user 1000 https://github.com/apache/druid/blob/master/distribution/docker/Dockerfile#L48, make sure you configure securityContext accordingly for the above UID. The operator supports both pod security context and container security Context. BTW you can take this as a reference too https://github.com/druid-io/druid-operator/issues/12#issuecomment-585939544

EwanValentine commented 3 years ago

If I set those to anything other than 0, including if I copy and paste your example above, I get:

mkdir: can't create directory 'var/tmp': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied

I don't plan on leaving it as 0, just whilst I'm trying to get something working, I spotted a separate issue raised around this, and a hotfix as part of version 0.21.0, so I think once that's released I can get around this properly. For context, I had to update to version 0.21.0, because the previous version of Druid wasn't honouring our service account IAM role. Which 0.21.0 fixes. I believe it was this: https://github.com/apache/druid/pull/11167/files | https://github.com/druid-io/druid-operator/issues/18

druid-io / druid-operator

OOMKilled issue using example cluster #174