Use StatefulSet instead of Deployment for TMs and JM

shashken commented 3 years ago

Hey @functicons , When using RocksDB as state backend an SSD is optimal for performance. When I tried to map a suitable StorageClass volume in the current deployment I failed, After an investigation, it seems there is no way to currently map a custom volume without migration from Deployment to StatefulSet. It seems the best way to do so is to use volumeClaimTemplates. (https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) I made a branch where I changed the TMs and JMs to use StatefulSet and added volumeClaimTemplates to a possible configuration. With that change I was able to map a custom StorageAccount volume and everything seems to be working good. StatefulSet has a parallel deployment mode that makes it deploy the pods parallel (like in Deployment) other than that I don't see a difference besides the name being more intuitive(No more random string at the end of the Pod's name) I'll create a PR to show the needed changes. Do you see any implications of using StatefulSet rather than deployment? The PR is: #354

functicons commented 3 years ago

@shashken, thanks for the proposal, but could you provide more detail about your problem with the current approach with Deployment? Why isn't it possible to create a PVC first, and reference it in the volumes field of the FlinkCluster CR?

shashken commented 3 years ago

Hey @functicons , here are some further details: With Deployments: We can either create a single PVC for all TMs (will be bad since TMs will write their state to a single volume and "share" its performance) Or we can Manually create X PVCs for X TMs and map them accordingly (this imitates the behavior of volumeClaimTemplates and opens us to all sorts of problems because of the "manual" approach).

In StatefulSet when a pod is started it creates a claim based on the template and its name (achieving the needed results without any fancy management from the operator).

An example for the problem k8s faces with Deployment here: Lets say you have 30 TMs and need 30 PVCs for them (SSD volume for each TM where the local rocksDB is written to), How will k8s know to map a PVC to a TM? (if you don't use the template mechanism in StatefulSet, you will have to specify a name in the PVC and map it to each TM) and when a pod crashes and gets built again by k8s, how can we use the previous PVC for it? or delete the old one and map it to it? those things seems to be impossible for a Deployment and require us to use StatefulSet.

I think we can't ignore the need to use an external volume for each TM separately for both:

disk space - large state deployments
performance - so the statebackend wont bottleneck the entire system

elanv commented 3 years ago

Have you ever tried emptyDir? If the local disk attached to Kubernetes node meets the performance requirements, it seems to be available option.

And perhaps the Flink operator should go towards supporting Flink native kubernetes features. #303, #168, #227 So, it would be good to consider how to align this requirement with Flink upstream.

shashken commented 3 years ago

Yes, emptyDir is mapped directy to the host, if the local state needs to be bigger than what the node has to offer it simply won't work and fill up the disk. Moreover its logical to use different storage types for different volumes (the state backend needs to be very fast, but the other volumes do not) and its doesnt make sense to change the entire k8s default disk settings for the flink cluster. @elanv

shashken commented 3 years ago

After another look on flink efforts on native k8s support, I see statefulset has another advantage discussed there as well, when using a system with statefulset and a single jobmanager, statefulset has a gurentee that there is only 1 active JM, allowing a simple HA solution like using a filesystem (replacing a zookeeper deployment). Do you guys see any advantage for deployment over statefulset? They seem to be pretty similar but with some advantages for statefulset in this case. @functicons @elanv

functicons commented 3 years ago

I don't see technical problems with replacing Deployment with StatefulSet, but although it didn't change the API/CRD, I still consider it a big change. Need to be careful not to break anything.

shashken commented 3 years ago

Agreed, The forked version (in PR) is running for more than a week for me (I must use it as we have a big state and the node storage is simply not enough, also because of the size it must perform well). What kind of tests would you like to perform to make sure nothing is broken?

functicons commented 3 years ago

Glad to know that you have tested it for a while with big state.

But it is not about tests, I'm more concerned about breaking existing deployments. For example, after merging this PR, it will require the service account (by default it is the default service account in the flink-operator-system namespace) used by the operator to have permission for creating StatefulSet, that would break existing deployments if they are using a service account without the permission.

But since the API / CRD is in beta, we don't guarantee no breaking changes.

chethanuk commented 3 years ago

replacing Deployment with StatefulSet

Replacing it for all deployments? Sometimes we have just normal Stateless job or some Beam Batch jobs or Flink Jobs -> these don't need StatefulSet

It's better to have a flag (for example) if we want to use RocksDB, only in this case use StatefulSet else using Statefulset for every single deployment is not right according to me.

shashken commented 3 years ago

It will be better to stick with a single option if it meets all requirements.
Can you explain what cons can you see for using StatefulSet? With the PodManagementPolicy flag setto parallel, the scaling times are the same, Are you aware of problems StatefulSets face vs. Deployments? @ChethanUK

chethanuk commented 3 years ago

StatefulSet bring in extra overhead [or may be just I have bad experience with sts on GKE]

Everything is not as simple as in Deployments, sometimes you need to manual intervene and do operations to fix the issues or restore from the backup state Sometimes pods won't rollback from error or crash state, requiring to manually force delete etc..

If our deployments continue to required the manual innervation then it's not ideal way going forward, and cost wise of also deployments is better.

I think it's best to:

Provide option to use sts or Deployments
Use sts only in case of RocksDB

GoogleCloudPlatform / flink-on-k8s-operator

Use StatefulSet instead of Deployment for TMs and JM #353