apache / incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.
https://uniffle.apache.org/
Apache License 2.0
382 stars 149 forks source link

[Improvement] Optimize local disk selection strategy #373

Open zuston opened 1 year ago

zuston commented 1 year ago

Code of Conduct

Search before asking

What would you like to be improved?

I want to raise this issue to improve stability when using MEMORY_LOCALFILE storage type. Maybe some issues will be as sub-tasks in this improvement.

The first improvement is to avoid all apps fail when single disk capacity reaches high-watermark. We could do below optimizations.

  1. Introduce the metrics of TOP10 apps which use the number of written bytes #333 .
  2. Introduce the free space & total space metrics of every local disk
  3. Introduce the pluggable disk selection strategy. Currently the disk will be selected based on the hash. Free-capacity based strategy should be supported.
  4. Allow app write data to another disk when encountering the corresponding disk reaching high-watermark #306

How should we improve?

No response

Are you willing to submit PR?

zuston commented 1 year ago

PTAL @jerqi @xianjingfeng @leixm @smallzhongfeng @kaijchen

jerqi commented 1 year ago
  1. We choose hash selection strategy. Because we want to reduce the size of meta data which we need maintain in the memory.
xianjingfeng commented 1 year ago
  1. Can we use Consistent Hashing?
advancedxy commented 1 year ago

Introduce the pluggable disk selection strategy. Currently the disk will be selected based on the hash. Free-capacity based strategy should be supported.

Agreed. Currently the hash based strategy may cause unbalanced disk I/Os among different disks as app's shuffle patterns may vary dramatically. Capacity and disk-stats based strategy is very nice to have.

advancedxy commented 1 year ago

Introduce the free space & total space metrics of every local disk

@zuston how do you plan to collect these metrics? By using df, or any other fancy ways?