Worker selection algorithms in the High Throughput Executor Interchange

benclifford commented 2 months ago

~Parsl will hopefully supervise one intern with Outreachy in the middle of 2024 on~ a project to add extra worker selection algorithms to the High Throughput Executor. This issue should track activity on this project.

The short project description from the outreachy website is:

===

Project Description More Efficient Worker Selection in Parsl's High Throughput Executor

Parsl's High Throughput Executor distributes tasks given to it by the core of Parsl to be executed on worker nodes. Right now, it picks an available worker randomly. This has worked well over the last 5 years, but there are some situations where a different algorithm could be used to pick a worker, and we think that we would get better performance if so.

We have four different ideas for algorithms. We dive further into these algorithms in the “Intern Tasks” section.

The goal of this internship is to implement some of those algorithms, to understand why they might (or might not) be helpful, and to measure how well they do or do not work.

Internship Tasks Getting Started

Get Parsl running and a development environment set up Running Parsl on a bigger computer system, which we'll get you access to Making a Pull Request – we'll walk you through making one to gain confidence Learn about Parsl's task distribution system (part of Parsl that you'll be working on the most) and Parsl's monitoring system (measure how efficiently things are working). This involves reading the source code, experimenting to understand the flow, and having hack sessions with mentors.

The First Big Change: After understanding how to implement different selection algorithms, where to make them, and why, you will:

Make those changes! Test them out with your mentor Write tests to check your changes work (so that when other people change the code, they don't break your stuff) Write documentation so users understand when and why they would want to use what you have implemented Make a pull request and experience the review process Get your changes merged and ready for release! There are four algorithms we would like to try (descriptions of each will be provided):

1: Scaling down worker nodes more effectively

we would like to be able to "scale in" some nodes so that they can be released for other users and so that our users do not need to pay for them. we already have code that can release a node when it isn't being used any more BUT! a node can only be released when it has no tasks on it (otherwise those tasks will fail) and only when released at the same time as all the other nodes in the block it was allocated with. so in order to get nodes empty like that, we need to choose, ahead of time, how we will place tasks to end up with all the nodes in some block empty, by choosing to places tasks on some other set of nodes instead. (because random selection does not do this) to measure the outcome of this, one idea is to show a plot of tasks overlaid/alongside a plot of how many nodes are allocated - we should see nodes reduced more using this algorithm when there are not many tasks, compared to making the same plots against the same workload using the random algorithm.

2: Avoiding nodes that will run out of execution time

in most execution environments, nodes are allocated only for a specific amount of time. we would like to avoid (for example) placing a task that we know will take about 10 minutes onto a node that we know has only about 5 minutes of time left before it is forcibly cancelled. the measurable outcome of this would be that given some workload, with the random algorithm, we'd expect to see a certain percentage of tasks fail as the node their are on is cancelled. with this new algorithm, we'd expect to see a smaller percentage of tasks fail - ideally 0%! to present this to users, probably what we need is something to calculate that percentage (or raw task counts)

3: Multi-block MPI

PR #3016 introduced a prototype for running MPI tasks in HighThroughputExecutor. It does not work properly with multiple batch jobs, because at the moment, the interchange cannot properly understand which batch jobs have free space (in an MPI-ranks sense)

4: Spreading the load evenly

Currently for regular workloads, the interchange allocates tasks to different workers randomly. This is intended to spread the load evenly. Are there better "spread the load evenly" strategies here? For example, sending tasks to the managers with least tasks on them?

You don't have to do them all – even one would be a great addition to our codebase and open the way for other people to continue. And don't worry if you don't understand everything written here right now – you will learn more about what's going on during the internship.

Intern Benefits We hope you'll get some good experience writing code that will be used by real users. We'd like you to get experience with how to take a new feature all the way to being released for users to use, so as well as writing code; you'll get some experience writing documentation, making sure things are tested, and explaining how your code works to other developers. In addition, you will make valuable connections and gain professional references.

===

benclifford commented 2 months ago

@danielskatz provided this link which sounds relevant to algorithm 2 (avoiding nodes that will run out of execution time). Its a link into a pay-walled site, but maybe there is some related public content we can find.

https://ieeexplore.ieee.org/document/5699433

danielskatz commented 2 months ago

There's a PDF of the paper on https://www.researchgate.net/publication/209198810_Scheduling_Many-Task_Workloads_on_Supercomputers_Dealing_with_Trailing_Tasks and slides from the talk are on http://datasys.cs.iit.edu/events/MTAGS10/paper10-slides.pdf

benclifford commented 2 months ago

I think that paper provides enough to make a synthetic workload like the workload in the paper; as well as some example graphs that can be recreated; and some results to compare against in practice - not quite the same as I'm expecting to be implemented for this project, because I'm expecting to be able to scale down at a more granular level than "all the workers" and I specifically want to avoid killing work. But generally this is a good base on which to start trying to imagine implementation.

danielskatz commented 1 week ago

https://github.com/applicationskeleton/Skeleton is a code that can generate synthetic workloads (or at least, it used to be...)

Parsl / parsl

Worker selection algorithms in the High Throughput Executor Interchange #3323