Closed EwanValentine closed 3 years ago
Which nodeType is crashing out of the above nodes ? can you do a kubectl get pods <name_of_pod> -n <namespace> -o yaml
and capture the exit code and error. Should be 137 i guess for oom.
Just to confirm is it a k8s oom or java oome ?
You might need to tune your jvm XMX and XMS accordingly with resources in k8s.
Here we go (I've redacted or removed a couple of bits, but hopefully this is enough info):
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2021-05-04T17:54:50Z"
generateName: druid-druid-cluster-historicals-6dd648bdf5-
labels:
app: druid
druid_cr: druid-cluster
nodeSpecUniqueStr: druid-druid-cluster-historicals
pod-template-hash: 6dd648bdf5
name: druid-druid-cluster-historicals-6dd648bdf5-jkxj2
namespace: ewanstenant
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: druid-druid-cluster-historicals-6dd648bdf5
uid: c0b695fd-3d48-4e75-85f0-4fe6c18f58e5
resourceVersion: "101032843"
spec:
affinity: {}
containers:
- command:
- /druid.sh
- historical
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: apache/druid:0.21.0
imagePullPolicy: IfNotPresent
name: druid-druid-cluster-historicals
ports:
- containerPort: 8088
name: druid-port
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /status/health
port: 8088
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /opt/druid/var/conf/druid/cluster/_common
name: common-config-volume
readOnly: true
- mountPath: /opt/druid/var/conf/druid/cluster/data/historical
name: nodetype-config-volume
readOnly: true
- mountPath: /opt/druid/var
name: data-volume
- mountPath:xxxxxxxxxxx
name: druid-scaling-spike-token-pwwcj
readOnly: true
- mountPath: xxxxxxxxxx
name: xxxxxxx
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: xxxxxxxx
nodeSelector:
service: ewanstenant-druid
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1000
runAsGroup: 1000
runAsUser: 1000
serviceAccount: druid-scaling-spike
serviceAccountName: druid-scaling-spike
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: ewanstenant-druid
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: xxxxxx
projected:
defaultMode: xxx
sources:
- serviceAccountToken:
audience: xxxxxxxx
expirationSeconds: xxxxxx
path: xxxxxx
- configMap:
defaultMode: 420
name: druid-cluster-druid-common-config
name: common-config-volume
- configMap:
defaultMode: 420
name: druid-druid-cluster-historicals-config
name: nodetype-config-volume
- emptyDir: {}
name: data-volume
- name: druid-scaling-spike-token-pwwcj
secret:
defaultMode: 420
secretName: druid-scaling-spike-token-pwwcj
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-05-04T17:56:20Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-05-04T17:56:20Z"
message: 'containers with unready status: [druid-druid-cluster-historicals]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-05-04T17:56:20Z"
message: 'containers with unready status: [druid-druid-cluster-historicals]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-05-04T17:56:20Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://b177f6f11c694bc19671bfebb50f9abba23b6f80f066e8c26cad50d70c0c7461
image: apache/druid:0.21.0
imageID: docker-pullable://apache/druid@sha256:e4e60c6c0a0bfa2a06b9d02e753533fb1f8ecffd7958de350e1118741c4dce5c
lastState:
terminated:
containerID: docker://b177f6f11c694bc19671bfebb50f9abba23b6f80f066e8c26cad50d70c0c7461
exitCode: 137
finishedAt: "2021-05-04T18:03:42Z"
reason: OOMKilled
startedAt: "2021-05-04T18:03:31Z"
name: druid-druid-cluster-historicals
ready: false
restartCount: 6
state:
waiting:
message: Back-off 5m0s restarting failed container=druid-druid-cluster-historicals
pod=druid-druid-cluster-historicals-6dd648bdf5-jkxj2_ewanstenant(51a3f4ae-f9ab-470b-89e4-2424a03ccbac)
reason: CrashLoopBackOff
hostIP: 172.35.186.210
phase: Running
podIP: 172.35.178.243
podIPs:
- ip: 172.35.178.243
qosClass: BestEffort
startTime: "2021-05-04T17:56:20Z"
They're all doing it apart from routers. I've tried removing resources entirely, but still the same strangely. Yes it looks like a K8s OOM, could be wrong though
Cheers!
not an expert in druid configurations, buts here what caught my eye
druid.segmentCache.locations=[{\"path\":\"/opt/druid/var/data/segments\",\"maxSize\":10737418240}]
druid.server.maxSize=10737418240
this is a pretty large number, i dont see any pvc attached for it too, do you mind lowering these values and checking. BTW try to refer to these configurations https://github.com/apache/druid/blob/master/examples/conf/druid/cluster/data/historical/runtime.properties
I've decreased this, set the memory limit to 4Gi, and the requests as 1Gi. I just get OOM immediately from K8s unfortunately. This is the last thing I see in the logs:
2021-05-05T12:11:03,426 INFO [main] org.apache.druid.server.initialization.jetty.JettyServerModule - Creating http connector with port [8083]
2021-05-05T12:11:03,828 WARN [main] org.eclipse.jetty.server.handler.gzip.GzipHandler - minGzipSize of 0 is inefficient for short content, break even is size 23
2021-05-05T12:11:04,130 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[0] of size[524,288,000]
2021-05-05T12:11:04,810 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[1] of size[524,288,000]
2021-05-05T12:11:05,499 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[2] of size[524,288,000]
2021-05-05T12:11:05,972 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[3] of size[524,288,000]
2021-05-05T12:11:06,272 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[4] of size[524,288,000]
2021-05-05T12:11:06,574 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[5] of size[524,288,000]
2021-05-05T12:11:06,874 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[6] of size[524,288,000]
2021-05-05T12:11:07,176 INFO [main] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[7] of size[524,288,000]
We're attempting to use the S3 deep storage, as we had permission issues on K8s/EKS with EFS unfortunately. And s3 deep storage seemed to get around that
[UPDATE] - If I set the memory limit on historicals to 16Gi
, then it seems to work, which is okay for now! I'll try and figure out a way to bring it down.
I'm not sure if this is related, but I also get:
2021-05-05T15:59:01,999 INFO [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-05-05T15:59:01,999 INFO [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Socket error occurred: localhost/127.0.0.1:2181: Connection refused
Although I've configured Zookeeper in the runtime properties:
# Zookeeper
druid.zk.service.host=tiny-cluster-zk-0.tiny-cluster-zk
druid.zk.paths.base=/druid
druid.zk.service.compress=false
If this helps you can refer to some configurations over here https://gist.github.com/AdheipSingh/2d3ed8fcd3b57a5b3e1a01f0bda3ba27
Regarding zookeeper, you can exec into any of your druid pods, and try to telnet to the svc and debug :)
Ahhh this is awesome, thank you so much!
I got a little further, I realised it wasn't picking up the zookeeper host because it wasn't picking up the shared runtime properties, which led me back to the permissions error. Then found another issue where someone set the runAsUser
to 0
as a temporary workaround, which seems to have done the trick! But I'll go through your example as well. Appreciate all your help :)
@EwanValentine druid runs as user 1000 https://github.com/apache/druid/blob/master/distribution/docker/Dockerfile#L48, make sure you configure securityContext accordingly for the above UID. The operator supports both pod security context and container security Context. BTW you can take this as a reference too https://github.com/druid-io/druid-operator/issues/12#issuecomment-585939544
If I set those to anything other than 0
, including if I copy and paste your example above, I get:
mkdir: can't create directory 'var/tmp': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied
mkdir: can't create directory 'var/druid/': Permission denied
I don't plan on leaving it as 0, just whilst I'm trying to get something working, I spotted a separate issue raised around this, and a hotfix as part of version 0.21.0, so I think once that's released I can get around this properly. For context, I had to update to version 0.21.0, because the previous version of Druid wasn't honouring our service account IAM role. Which 0.21.0 fixes. I believe it was this: https://github.com/apache/druid/pull/11167/files | https://github.com/druid-io/druid-operator/issues/18
Hey there,
I'm attempting to use the operator on a Kubernetes cluster (EKS). I'm using a couple of m5.large EC2 instances, but I'm getting an OOMKilled error. Here's my config:
Appreciate any advice as to what the correct resource limits should be. Also, apologies if this is a more general request. Although I've not had the same problem with the helm chart, so wondered if I was missing something with the way the operator works around resources. Cheers!