Open elanv opened 4 years ago
When creating this operator, I considered defining 2 CRDs, one for clusters, the other for jobs, but then I decided to just use one CRD for both session clusters and job clusters. The reasoning was: 1) I want to keep the usage simple; 2) I thought job cluster is the preferred way of running Flink jobs, session cluster might be eventually deprecated.
But what you said above sounds reasonable to me. Having 2 CRDs will be more flexible. In addition to the reasons you listed, it allows us to add more job types (e.g., Beam) easily.
But for the existing job cluster CRD, do you plan to split it into 2 CRDs or keep it untouched? I feel we should split it, which means we unify the model for session cluster and job cluster.
@functicons Sounds good. I also feel it's better unifying the model.
It would be nice to introduce Flink job CRD as an experimental feature and deprecate job
field of FlinkCluster
gradually or in the next API version.
I think this feature is a major change, so you can start a new API version v1beta2. It doesn't block anything for v1beta1.
There are two motivations for the introduction of a new CRD for Flink job submission.
First, it is the case of operating an application consisting of a number of light Flink jobs. The per-job cluster has the advantage of isolating clusters and resources for each job to protect jobs from failures of another jobs, or to allocate resources cleanly in a multi-tenancy environment. However, inefficient use of resources is a disadvantage when running an application composed of multiple small Flink jobs. When running multiple light Flink jobs, dedicated job managers and task managers should be provided for each job, and the overhead of the JVM and Flink framework itself consumes much more resources than session mode. Isolation at the level of sharing a cluster between closely related jobs also seems reasonable.
There are several alternatives to this, but each has its own weakness.
Although application mode was introduced in Flink 1.11, there is a disadvantage that deployment cannot be controlled for each job composing the Flink application.
In the case of Flink operator's session cluster mode, it is not possible to utilize the advantages of per-job mode such as various management automation features, declarative configuration and Kubernetes API for Flink job.
Secondly, in Flink's Kubernetes native support, dynamic task manager deployment feature was introduced. Unlike before, the session cluster's size is scaled on-demand when submitting Flink job, therefore session cluster mode becomes much more useful in Kubernetes.