[FRS-72] Fix the heartbeat issue caused by zookeeper restart

What is the purpose of the change

This resolves #72 . Currently, the shuffle manager may fail to remove a lost shuffle worker if the Zookeeper restart which will cause the change of RPC main thread executor. This patch fixes the issue.

Brief change log

Add e2e test to cover the scenario.
Use the cluster IO executor to perform heartbeat timeout check.
Fix a metric issue as a hotfix commit.

Verifying this change

This change added tests.

flink-extended / flink-remote-shuffle

[FRS-72] Fix the heartbeat issue caused by zookeeper restart #73

What is the purpose of the change

Brief change log

Verifying this change