druid-io / druid-operator

Druid Kubernetes Operator
Other
205 stars 93 forks source link

Coordinator is unbalanced #249

Closed sneerin closed 2 years ago

sneerin commented 2 years ago

I have 4-6 coordinators, but usually only one goes extremely loaded, the rest is just not showing any CPU activity. sample config :

      nodeType: "coordinator"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-nodepool
                    operator: In
                    values:
                      - druid-master
      tolerations:
      - key: "druid-master"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"       
      druid.port: 8088
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/master/coordinator-overlord"
      replicas: 6
      resources:
        limits:
          cpu: 7
          memory: 12Gi
        requests:
          cpu: 2
          memory: 6Gi        
      runtime.properties: |
        druid.service=druid/coordinator

        # HTTP server threads
        druid.coordinator.startDelay=PT30S
        druid.coordinator.period=PT30S

        # Configure this coordinator to also run as Overlord
        druid.coordinator.asOverlord.enabled=true
        druid.coordinator.asOverlord.overlordService=druid/overlord
        druid.indexer.queue.startDelay=PT30S
        druid.indexer.runner.type=local
        druid.server.http.numThreads=3
      extra.jvm.options: |-
        -Xmx6G
        -Xms512M        
        -Daws.region=us-east-1

kuectl top po | grep druid

druid-analytics-brokers-0                                         120m         3905Mi      

druid-analytics-brokers-1                                         107m         3922Mi  

druid-analytics-brokers-2                                         135m         3881Mi          

druid-analytics-brokers-3                                         136m         3905Mi          

druid-analytics-brokers-4                                         126m         3891Mi          

druid-analytics-brokers-5                                         124m         3894Mi          

druid-analytics-coordinators-0                                    6980m        11486Mi         

druid-analytics-coordinators-1                                    70m          1093Mi          

druid-analytics-coordinators-2                                    69m          824Mi           

druid-analytics-coordinators-3                                    54m          688Mi           

druid-analytics-coordinators-4                                    43m          867Mi           

druid-analytics-coordinators-5                                    141m         853Mi           

druid-analytics-historicals-0                                     769m         13544Mi         

druid-analytics-historicals-1                                     1949m        13987Mi         

druid-analytics-historicals-2                                     5207m        15657Mi         

druid-analytics-middlemanagers-0                                  4m           453Mi           

druid-analytics-middlemanagers-1                                  4m           462Mi           

druid-analytics-middlemanagers-2                                  5m           458Mi           

druid-analytics-middlemanagers-3                                  5m           450Mi           

druid-analytics-routers-0                                         9m           322Mi           

druid-analytics-routers-1                                         14m          321Mi         

as you can see coordinator CPU is extremely high. Any way to force balancing? The coordinator has few physical nodes. but load always goes to one node.

harinirajendran commented 2 years ago

In druid, for coordinator and overlord, there is always only 1 leader who does the bulk of the work. So, this behavior is expected. Coordinator APIs to see who the leader is.

sneerin commented 2 years ago

thanks for the answer! so based on that it will be more efficient to give more CPU to that instance instead of pushing few coordinators on different nodes, is that correct?

harinirajendran commented 2 years ago

Yeah, you can give more CPU to the coordinators to handle this workload. For high availability, it's good practice to have at least 2 coordinator nodes, so that if the leader goes down for any reason, the other one can become the leader.

sneerin commented 2 years ago

thanks a lot, I guess it can be closed