bytedance / CloudShuffleService

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.
Apache License 2.0
247 stars 57 forks source link

Can zk support high-frequency operations, and whether zk will be a bottleneck? #2

Open long1208 opened 2 years ago

bdyx123 commented 2 years ago

the operations for zk are not very frequently, one is that css workers update status, another is the create/delete operations when register/unregister shuffle. Currently we have 7 zk nodes for a CSS cluster which has hundreds of workers. The zk pressure(memory) mainly comes from lots of zk watches, which we used for tracking the shuffleId lifetime to clean data on css workers. We are doing the optimization for this.

a140262 commented 1 year ago

@bdyx123 , have you seen the following exceptions from Spark application logs? It seems CSS worker has deleted the shuffleID before it tries to update it. Is this behavior normal?

ERROR ZookeeperExternalShuffleMeta: Update zk shuffle node spark-00000003119c3rhae9v-117 failed.
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /css/mycss/shuffles/spark-00000003119c3rhae9v-117