yahoNanJing commented 1 year ago

Which issue does this PR close?

Closes #708.

Rationale for this change

What changes are included in this PR?

The ExecutorReservation has been removed and the related task assignment philosophy is also changed from executor first to task first. This change will bring many benefits:
- Necessary for consistency-based data cache aware task scheduling as mentioned in #708.
- Avoid too frequently task scheduling, since one task update may cause a task scheduling, which may cause too much state contention and slow down the whole throughput. Based on this PR, we can do further improvement, like batch event processing to improve the efficiency. Sample code is here. Running with the branch, the throughput can be improved by around 50%. And later I will raise PRs step by step to contribute the branch to the main branch.
The metrics for the pending task number is changed to be pending job number and running job number, since the calculation for the accurate metrics for the pending task number will be a bit heavy and unnecessary.

Are there any user-facing changes?

yahoNanJing commented 1 year ago

Hi @thinkharderdev and @Dandandan, could you help review this PR?

collimarco commented 1 year ago

+1 for merging this, so that it enables the implementation of this major feature: https://github.com/apache/arrow-ballista/issues/645

yahoNanJing commented 1 year ago

Thanks @thinkharderdev for your comments.

I'm still a little confused as to why this is required to enable caching.

For consistent hashing based task assignment, we should do the task assignment based on the scan files of the task if there is. The details is described in #833. This means it's necessary to assign a specific executor for a task rather than assign a random task for an executor. To achieve good data cache ware task scheduling, it's necessary for the scheduler to have a global view of the cluster's executor slots.

The original goal of the ExecutorReservation was to minimize contention on the task slots state.

I totally understand the purpose of ExecutorReservation. However, for the current implementation, it actually does not reduce the contention too much. https://github.com/apache/arrow-ballista/blob/b65464e4b73590470fa69aad5b6954300ad243a0/ballista/scheduler/src/state/mod.rs#L190-L228

From the above code, if there are still some pending tasks, it will still go to invoke reserve_slots.

To reduce the resource contention or lock contention, based on this PR, I'll raise another PR to refactor the event processing to introduce batch event processing. For example, to combine 10 task status update event to one so that only one resource contention will be involved. Sample code can be found here With this new implementation, the throughput can be improved around 50% on our load testing.

And for a cluster with multiple active schedulers, the reservation mechanism may cause some scheduler hungry.

yahoNanJing commented 1 year ago

Since this PR has been under review for half a month, if there's no opposite options, I'll merge this in next a few days so that the data cache related PRs can go on.

apache / datafusion-ballista

Remove ExecutorReservation and change the task assignment philosophy from executor first to task first #823

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?