apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.57k stars 198 forks source link

Implement 3-phase consistent hash based task assignment policy #833

Closed yahoNanJing closed 1 year ago

yahoNanJing commented 1 year ago

Which issue does this PR close?

Closes #831.

Rationale for this change

What changes are included in this PR?

The three rounds cache aware task Scheduling are as follows:

  1. Assign non-map stage tasks(without scanning files) in a round robin way
  2. Assign map stage tasks (scanning files) based on the consistent hashing policy on the hash value of the file name and the executor topology
  3. Assign tasks with scanning files based on the consistent hashing policy on the hash value of the file name and the executor topology with N tolerance. These tasks will not trigger data caching.

Are there any user-facing changes?

yahoNanJing commented 1 year ago

Hi @collimarco, by this whole PR, the data cache feature will be feasible. If you are in urgent, you may have a try of running this PR. And let's consider to merge this PR after https://github.com/apache/arrow-ballista/pull/830 merged.