Operator pod fails to run when following steps in getting_started.md

druid-io / druid-operator

Druid Kubernetes Operator

Other

205 stars 93 forks source link

Operator pod fails to run when following steps in getting_started.md #281

Open jameskelleher opened 2 years ago

jameskelleher commented 2 years ago

I have a fresh Fargate cluster that I've spun up through EKS. I've followed the steps for installing the operator found in getting_started.md. The operator pod fails to run, restarting 5 times and ending in a CrashLoopBackOff. Based on the kubectl describe response, it looks like readiness and liveness probes are failing. I've attached the log and describe responses, let me know if there's additional useful resources I could share. I am on OSX, and for sure used the correct sed command.

describe.txt

log.txt

jameskelleher commented 2 years ago

Noticed that when I try to install the operator via Helm, I got the following warning in the describe response:

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  2m48s (x3 over 3m13s)  default-scheduler  0/2 nodes are available: 2 node(s) had taint {eks.amazonaws.com/compute-type: fargate}, that the pod didn't tolerate.

Does the Druid operator not work properly with Fargate nodes? That would explain why my first approach failed as well.

jameskelleher commented 2 years ago

Additional update: tried to install Apache's Helm chart, and I got the same error message about Fargate being an intolerable taint. I guess Druid simply does not work on Fargate clusters?

pleszczy commented 2 years ago

Taints and tolerations are a basic Kubernetes concept, and what you are seeing has nothing to do with the druid operator. You have to add a toleration e.g.

tolerations:
        - key: eks.amazonaws.com/compute-type
          value: fargate
          effect: NoSchedule
          operator: Equal

jameskelleher commented 2 years ago

Hi @pleszczy thanks for responding! Managed to figure that out, I'm learning Druid, k8s, and EKS which is a lot to take in all at once.

I was still unable to install the Druid operator, even on a non-Fargate cluster. The pod would fail with a CrashLoopBackOff error. As I'm new to all this I wasn't sure where to even start looking (esp bc this error is so open-ended), so I tried out Druid's sample Helm chart and got a cluster up based on that.

Still I'm curious why the druid-operator pod would crash, even on a fresh, non-Fargate EKS cluster. Is there another setup step that I'm unaware of? Should I just be able to plug-and-play, i.e. spin up a new cluster and kubectl apply?