zhangyue19921010 commented 3 years ago

Motivation

Druid use MiddleManager service to launch Peons for data ingestion. Users can set druid.indexer.runner.javaOpts in MiddleManager runtime.properties to control the JVM config of Peon like memory size. Overlord will schedule peon running on the property MiddleManager node based on task slots.

As for current resource scheduling model mentioned above, there are a few limitations:

The resources utilization of MiddleManager node is uncontrollable, and MiddleManager needs to occupy a large amount of memory resources in advance to provide sufficient resource for the possible peons.
Different types of tasks need to use the same resource properties, causing a waste. For example, a lower workload batch task also need to use unified resources same as Kafka ingestion task. Although users can set the property druid.indexer.runner.javaOpts in the task context to modify the JVM parameters of a specific peon. But current Druid resource scheduling mode is based on slots. So that users can only specify a smaller memory size, because if set a larger memory size in task context, it will cause the memory of MiddleManager to be over-allocated and OOM. On the other hand, because of resources pre-allocated, setting a lower memory size in a specify peon here is meaningless.
Peon will need CPU resources to do calculations or respond to queries and different types of tasks have different requirements for the CPU. Current Resource Schedule Model did not limit cpu resources. It may cause a waste of CPU resources when multiple low cpu tasks running on the same MiddleManager node. Or it may cause excessive cpu usage leads to longer query time when multiple high cpu tasks run on the same MiddleManager node. Therefore, it is also necessary to limit cpu resources.

Proposed changes

A new extension-contrib druid-kubernetes-middlemanager-extensions would be added with implementations of BasedRestorableTaskRunner named K8sForkingTaskRunner, a new module named K8sMiddleManagerModule and so on. Additionally, since this is first such extension, there might be some changes needed in core as well to enable writing the extension.

Also will add some new properties in MiddleManager runtime.properties:

Property	Description	Default
druid.indexer.runner.mode	The running mode of MiddleManger-Peon. If set `druid.indexer.runner.mode=k8s`. MiddleManager will create and own Peon pod to do ingest action on K8s.	native
druid.indexer.namespace	The namespace of Druid cluster on K8s.	default
druid.indexer.serviceAccountName	The serviceAccount for Druid cluster to use.	default
druid.indexer.image	The Druid based image.	druid/cluster:v1
druid.indexer.default.pod.memory	The default memory limitation of peon pod created by MiddleManager	2G
druid.indexer.default.pod.cpu	The default cpu limitation of peon pod created by MiddleManager	1

Add some new properties in task context

Property	Description	Default
druid.peon.javaOpts	The JVM configs of specific peon pod.	JVM configs in MiddleManager runtime.properties.
druid.peon.pod.memory	The memory limitation of specific peon pod created by MiddleManager	JVM configs in MiddleManager runtime.properties.
druid.peon.pod.cpu	The cpu limitation of specific peon pod created by MiddleManager	JVM configs in MiddleManager runtime.properties.

As you can see, the priority of properties mentioned above is Task Context > runtime.properties > Coding default values.

Need to add "druid-kubernetes-middlemanager-extensions" in druid.extensions.loadList only for MiddleManager runtime.properties.

Rationale

Based on ForkingTaskRunner, make a new runner named K8sForkingTaskRunner.

Instead of using ProcessBuilder.start() to create a create a new child process in ForkingTaskRunner. We use kubernetes-java-client to create and running tasks in peon pod. Also do stop, trace, log and garbage collection through K8s.

Use ConfigMap to pass task.json from MiddleManager to Peon pod. There is a conflict between local dictionary and configmap mountPath. MountPath doesn't allow to use ":" in path. So we have to do the pass carefully.
Use ownerReference to do garbage collection, so that when peon is done. everything related like configmap will be deleted automatically.
Need to do communication between MiddleManager and Peon pod for log collection and lifecycle control.
Use kubernate-java-client do create pod, wait for pod running, wait for pod finished and so on.

Advantage

Cost saving and Improve resource utilization.

We just use peon pod to do data ingestion and let K8s cluster to do Resource Scheduling work which K8s is good at. When Druid cluster enable MOK, Users can set different cpu/memory resources for different tasks. And K8s will schedule and run this peon pod with high resource utilization.

Also If we combine pod and something like AWS Fargate(https://aws.amazon.com/fargate/). Resource usage and cost can further improve. MiddleManager can temporarily require for appropriate resources(you just need to pay for the sources which are required here) and run peon pod. AND release these resources after task finished.

In short, there is no need to let MiddleManager take up a lot of resources in advance, and just require resources whenever it will use. And different kinds of tasks can use different configs including CPU resources

Operational impact

None

Test plan(optional)

I would be testing the extension on Dev Druid clusters deployed in K8s including data ingestion and data query.

egor-ryashin commented 3 years ago

Just wondering if you've checked this one https://github.com/apache/druid/issues/8801

zhangyue19921010 commented 3 years ago

Sorry I didn't notice this issue before.

zhangyue19921010 commented 3 years ago

After chatting with nishantmonu51, I will re-open this issue and raise a PR to contribute my work.

apache / druid

A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK) #10824

Motivation

Proposed changes

Rationale

Advantage

Operational impact

Test plan(optional)