Multiple JobManager replicas?

a-roberts commented 3 years ago

Hi everyone, have noticed we can have multiple TaskManager replicas, does it make sense to have multiple JM ones? What if I don't want to use something like Zookeeper for HA?

Is it

a) a weird idea (e.g. for Flink in general?) b) off for a reason (was it tough/pointless?) c) introducing lots of complexity when we needn't?

At this point I'm just curious so any advice would be awesome, thank you!

wangyang0918 commented 3 years ago

Yeah. It makes sense to have multiple JobManagers running. After then we could get a faster recovery since we have standby JobManagers, which could take over the leader and start to schedule the Flink jobs after the old one crashed. From Flink 1.12, we have introduce a Kubernetes Ha service for Flink. You will not need to manage a ZooKeeper cluster anymore.

Refer here[1] for how to enable K8s HA for your Flink cluster and here[2] for more information about how it works.

[1]. https://ci.apache.org/projects/flink/flink-docs-master/deployment/ha/kubernetes_ha.html [2]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-144:+Native+Kubernetes+HA+for+Flink

shashken commented 3 years ago

@wangyang0918 Awesome work! thank you! I have a question regarding the "Alternative HA implementation" part. We have just migrated to use StatefulSet instead of Deployment (#354, merged 2 days ago). Would you recommend to consider the alternative solution you suggested there as it reduces complexity?
There was a discussion in this repo about the need for multiple JMs. In a containerized env, the image pull time for a single pod with the time it takes to get running isn't very significant, what do you think about that? would really like to hear your opinion on this as the author for this feature and suggestion.

wangyang0918 commented 3 years ago

@shashken Since we already have finished the Kubernetes HA service, I do not suggest to add a new simple but not perfect solution. Actually, "StatefulSet + PV + FileSystemHAService" could not make things easier. We still need a distributed storage(S3, Azure blob, HDFS, etc.) for the checkpoint and job graph persistence. The only difference is it will store the pointer to the distributed storage in the PV instead of ConfigMap.

I know that launching a new JobManager pod will not take too long time. However, it is still necessary to have multiple JobManagers for some very important Flink applications. No one will know what could happen when creating a new Pod(failed due to network, disk issues). And the alternative could not support multiple JobManagers since it does not have leader election/retrieval service.

I do not mean the "Kubernetes HA service" now is stable enough for any production use cases. It is just born. But you could file a ticket in Flink project if you run into some issues, I believe the community could find out the root cause and fix it ASAP.

GoogleCloudPlatform / flink-on-k8s-operator

Multiple JobManager replicas? #378