apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.44k stars 3.69k forks source link

A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10824

Open zhangyue19921010 opened 3 years ago

zhangyue19921010 commented 3 years ago

Motivation

Druid use MiddleManager service to launch Peons for data ingestion. Users can set druid.indexer.runner.javaOpts in MiddleManager runtime.properties to control the JVM config of Peon like memory size. Overlord will schedule peon running on the property MiddleManager node based on task slots.

As for current resource scheduling model mentioned above, there are a few limitations:

  1. The resources utilization of MiddleManager node is uncontrollable, and MiddleManager needs to occupy a large amount of memory resources in advance to provide sufficient resource for the possible peons.
  2. Different types of tasks need to use the same resource properties, causing a waste. For example, a lower workload batch task also need to use unified resources same as Kafka ingestion task. Although users can set the property druid.indexer.runner.javaOpts in the task context to modify the JVM parameters of a specific peon. But current Druid resource scheduling mode is based on slots. So that users can only specify a smaller memory size, because if set a larger memory size in task context, it will cause the memory of MiddleManager to be over-allocated and OOM. On the other hand, because of resources pre-allocated, setting a lower memory size in a specify peon here is meaningless.
  3. Peon will need CPU resources to do calculations or respond to queries and different types of tasks have different requirements for the CPU. Current Resource Schedule Model did not limit cpu resources. It may cause a waste of CPU resources when multiple low cpu tasks running on the same MiddleManager node. Or it may cause excessive cpu usage leads to longer query time when multiple high cpu tasks run on the same MiddleManager node. Therefore, it is also necessary to limit cpu resources.

Proposed changes

A new extension-contrib druid-kubernetes-middlemanager-extensions would be added with implementations of BasedRestorableTaskRunner named K8sForkingTaskRunner, a new module named K8sMiddleManagerModule and so on. Additionally, since this is first such extension, there might be some changes needed in core as well to enable writing the extension.

Also will add some new properties in MiddleManager runtime.properties:

Property Description Default
druid.indexer.runner.mode The running mode of MiddleManger-Peon. If set druid.indexer.runner.mode=k8s. MiddleManager will create and own Peon pod to do ingest action on K8s. native
druid.indexer.namespace The namespace of Druid cluster on K8s. default
druid.indexer.serviceAccountName The serviceAccount for Druid cluster to use. default
druid.indexer.image The Druid based image. druid/cluster:v1
druid.indexer.default.pod.memory The default memory limitation of peon pod created by MiddleManager 2G
druid.indexer.default.pod.cpu The default cpu limitation of peon pod created by MiddleManager 1

Add some new properties in task context 

Property Description Default
druid.peon.javaOpts The JVM configs of specific peon pod. JVM configs in MiddleManager runtime.properties.
druid.peon.pod.memory The memory limitation of specific peon pod created by MiddleManager JVM configs in MiddleManager runtime.properties.
druid.peon.pod.cpu The cpu limitation of specific peon pod created by MiddleManager JVM configs in MiddleManager runtime.properties.

As you can see, the priority of properties mentioned above is Task Context > runtime.properties > Coding default values.

Need to add "druid-kubernetes-middlemanager-extensions" in druid.extensions.loadList only for MiddleManager runtime.properties.

Rationale

Based on ForkingTaskRunner, make a new runner named K8sForkingTaskRunner. 

Instead of using ProcessBuilder.start() to create a create a new child process in ForkingTaskRunner. We use kubernetes-java-client to create and running tasks in peon pod. Also do stop, trace, log and garbage collection through K8s.

  1. Use ConfigMap to pass task.json from MiddleManager to Peon pod. There is a conflict between local dictionary and configmap mountPath. MountPath doesn't allow to use ":" in path. So we have to do the pass carefully.
  2. Use  ownerReference to do garbage collection, so that when peon is done. everything related like configmap will be deleted automatically. 
  3. Need to do communication between MiddleManager and Peon pod for log collection and lifecycle control.
  4. Use kubernate-java-client do create pod, wait for pod running, wait for pod finished and so on.
屏幕快照 2021-02-01 下午2 41 04 屏幕快照 2021-02-01 下午2 35 48

Advantage

Cost saving and Improve resource utilization.

We just use peon pod to do data ingestion and let K8s cluster to do Resource Scheduling work which K8s is good at. When Druid cluster enable MOK, Users can set different cpu/memory resources for different tasks. And K8s will schedule and run this peon pod with high resource utilization.

Also If we combine pod and something like AWS Fargate(https://aws.amazon.com/fargate/). Resource usage and cost can further improve. MiddleManager can temporarily require for appropriate resources(you just need to pay for the sources which are required here) and run peon pod. AND release these resources after task finished.

In short, there is no need to let MiddleManager take up a lot of resources in advance, and just require resources whenever it will use. And different kinds of tasks can use different configs including CPU resources

Operational impact

None

Test plan(optional)

I would be testing the extension on Dev Druid clusters deployed in K8s including data ingestion and data query.

egor-ryashin commented 3 years ago

Just wondering if you've checked this one https://github.com/apache/druid/issues/8801

zhangyue19921010 commented 3 years ago

Sorry I didn't notice this issue before.

zhangyue19921010 commented 3 years ago

After chatting with nishantmonu51, I will re-open this issue and raise a PR to contribute my work.