This resolves #72 . Currently, the shuffle manager may fail to remove a lost shuffle worker if the Zookeeper restart which will cause the change of RPC main thread executor. This patch fixes the issue.
Brief change log
Add e2e test to cover the scenario.
Use the cluster IO executor to perform heartbeat timeout check.
What is the purpose of the change
This resolves #72 . Currently, the shuffle manager may fail to remove a lost shuffle worker if the Zookeeper restart which will cause the change of RPC main thread executor. This patch fixes the issue.
Brief change log
Verifying this change
This change added tests.