flink-extended / flink-remote-shuffle

Remote Shuffle Service for Flink
Apache License 2.0
191 stars 57 forks source link

Load balance improvement #14

Closed wsry closed 2 years ago

wsry commented 2 years ago

Motivation

Currently, the load balance strategy is pretty simple, DataPartitions are evenly distributed to all ShuffleWorkers. A better way is to allocate shuffle resources based on the actual capacity of ShuffleWorker.

Changes

Allocate shuffle resources based on the actual capacity of ShuffleWorker. Each ShuffleWorker can also select disk based on disk throughput.

Test

TanYuxin-tyx commented 2 years ago

About the load balance improvement issue, I implement a simple version. This PR mainly balances the load according to the worker storage, but not considering the worker throughput load. This change also abstracts the logic of worker selection, which makes it more convenient to add more elegant load balance strategies.

@wsry, @gaoyunhaii Could you please help review the code when having time? Thanks.