[FRS-47] Job can switch to new shuffle cluster only when the old cluster is unavailable

wsry commented 2 years ago

What is the purpose of the change

This solves #47 . Currently, a running job will switch to the new shuffle cluster if the new one has higher version. However, this may cause the shuffle data of one job is managed by multiple clusters which is complicated and easy to cause problem. By only switching to new cluster when the old cluster is unavailable. We can simplify the logic and further support upgrading without influence the running job if the old cluster is kept until all running jobs finish.

Brief change log

Job can switch to new shuffle cluster only when the old cluster is unavailable.

Verifying this change

This change added tests.

Aitozi commented 2 years ago

Hi @wsry Is this also solved by https://github.com/flink-extended/flink-remote-shuffle/pull/49 ?

wsry commented 2 years ago

Hi @wsry Is this also solved by #49 ?

This works together with #49. When there are multiple clusters, a job can select the proper one with some strategy. Currently, the strategy is always selecting the rss with higher version. In the future, we can Implement more complicated strategy, like auto switch some job to new version for grayscale test.

flink-extended / flink-remote-shuffle