druid-io / druid-operator

Druid Kubernetes Operator
Other
205 stars 92 forks source link

Proposal: Ingestion Spec controller #313

Open AdheipSingh opened 1 year ago

AdheipSingh commented 1 year ago

Current State of Druid Operator

Druid operator is supporting installation, upgrade and maintaining a druid cluster. Internally druid operator has a druid controller which talks to the k8s api for operations. Most of the intelligence built is in from an k8s installation perspective, the CRD spec is very flexible. The current reconcile loop is stable and battle tested. Current CRD's belong to group druid.apache.org, and is in v1alpha1 version with Druid as the only kind supported. The manager hooks in a single controller ie druid_controller.

Goals

IMHO druid operator ( the operator framework ) is powerful enough to leverage kubernetes as a control plane for running druid. All operations and specs can be handled as CRD definitions. Automating and handling supervisors configs for ingestion for a druid cluster by adding a new CRD to the group druid.apache.org.

Design

Seperation of concerns , A new CRD + Controller

Authentication with Druid API

Design of the CRD, CR Spec and Reconcilation

Reconcilation and State Changes

Controllers are combination of level driven and event driven. An update to the druidingestion CR can be reconciled as an update event. Still there is a possibility of an outage. The reconcile loops triggers every N seconds to prevent that. In the current controller all configs in the CR are converged to first class kubernetes objects. The supervisor spec can be created as a configmap, this configmap can help incase an event is missed, we can trigger an update if the current state is not same as desired state. Druid operator adds an objectHash to the cm. ( same flow as the current controller )

A CRD shall have a status. The status shall be patched with the fields from the http response from druid API, status shall have the supervisor id. To suspend the supervisor spec, controller shall the get the id from status, and POST request to overload api to suspend the supervisor /druid/indexer/v1/supervisor//suspend

Updating suspend to false in the ingestion CR, shall cause reset of the supervisor spec. Operator shall emit events using events API for each operation handled and update the status of the CR. A deletion of the druid ingestion config shall be controlled by finalizers. Before the deletion controller makes an HTTP call to delete the supervisor spec, at this point the CR will be marked as terminating. Once requests are completed CR will be removed.


This proposal might have missed in some druid specific details of the API. The original issue : https://github.com/druid-io/druid-operator/issues/251

cintoSunny commented 1 year ago

What do you think are the advantages of having this in operator/crd, instead of having them in druid? If I understand this correctly, instead of directly submitting a job to Druid, users have to deploy a CRD. Correct me if I am wrong here. My concerns are that deploying a crd every time may not be feasible for everyone. Not sure what everyone else thinks.

AdheipSingh commented 1 year ago

What do you think are the advantages of having this in operator/crd, instead of having them in druid? If I understand this correctly, instead of directly submitting a job to Druid, users have to deploy a CRD. Correct me if I am wrong here. My concerns are that deploying a crd every time may not be feasible for everyone. Not sure what everyone else thinks.

Do you see kubernetes as an orchestration platform for running druid or you see kubernetes as a control plane for running druid. If you consider the latter, you will can leverage CRD for handling supervisor specs etc.

Just like kafka operator, you can deploy kafka + manage topics and acls via crds. Of course, you can create kafka topics using cli clients same way we can create druid supervisors from the console, but if you want full control from k8s + enhance gitops and build operator as a control plane, this can be a way.