Vine: Main Loop events order

BarrySlyDelgado commented 1 year ago

From previous discussions we have noted ranks of tasks being generated by the manager. Most likely this is due to the ordering of events taken in the main loop. Here is a bit of an overview of what is happening in the loop:

If a task has been retrieved (Marked DONE)
update catalog
retrieve worker statuses (poll_active_workers) checks for messages if the worker sends "available_results" adds to workers_with_results table for the workers with available results do a syncronous send and recieve for tasks (Marked WAITING_RETRIEVAL)
recieve one task (get output files) (go back to top) (Marked RETRIEVED)
send one task(go back to top)

Since many tasks are likely completing at the same time when the manager polls the workers, the manager is doing many synchronous send and recieves. After this, only one task is retrieved and is subsequently marked as done (as we go back to the top of the loop). Essentially, after each iteration of the loop, there is a potential that more than one task is set to WAITING_RETIREVAL. However, at most one task will be set to RETRIEVED and DONE per loop. Thus, if N tasks are set to WAITING_RETRIEVAL in one iteration there will be at least N more loops before any new tasks are sent to workers. The size of N can increase after each loop up to M where M is the number of tasks currently dispatched. So, if the manager was able to do N loops before N==M then tasks will be dispatched again. But, once N == M all tasks will have to be retrieved and marked done until tasks are sent again.

From the graphs shown, The manager actually does a few retievals before the big wall of retirevals. Each green dot is essentially one loop. So N is the number of red dots before one green dot. At the next green dot N = N + number of red dots since the last green dot - 1. eventually N ==M (first graph). So, in scenarios where N = 1 on the first loop and N = 0 on the second loop tasks will be sent again.

Screenshot from 2023-02-08 18-23-21

BarrySlyDelgado commented 1 year ago

@dthain @btovar

dthain commented 1 year ago

Thanks, this is a nice summary. My take is that the main loop is functioning as intended: we prefer to do certain operations before the others. But the unexpected result comes from the separation of getting the result of a task (exit status and stdout) from the getting of output files of that task. Because the manager goes to all the workers to get results, and then returns to all the tasks (in some order) to get files, we end up with unnecessary dead time between. And, if the task has no output files, then it is a total waste of time.

So, I propose the following experiment:

The manager should poll all workers and receive all asynchronous messages, as before.
The manager should select one worker with available results, and receive all those results, as before. Then, it should immediately fetch all outputs from tasks on that worker. Then, go back to the top.

I hypothesize that this will reduce the "dead time" between task completion and dispatch that we see as a convoy effect in this graph.

Thoughts?

BarrySlyDelgado commented 1 year ago

My Initial hypothesis is that this would decrease the space between the red and green markers from the graph above. However, I'm not sure if this would decrease the time between tasks as the manager would have to do this for each worker with available results. If the manager also sent a task before going to the top it may produce a convoy effect.

dthain commented 1 year ago

I think you are correct that the manager will still prefer to retrieve results than send tasks (which we deliberately chose.). But I think this will help to reduce the size of the gap between each rank of tasks. And it will accelerate tasks for which no additional output files are needed.

btovar commented 1 year ago

I am not sure that the gap will be reduced, as there would not be a change if there was only one task per worker, or if the task at the worker had no outputs. Further, any change that reduces the gap for just some workers will make the problem appear again at a slightly larger scale, and worse, because, as Barry mentions, once we hit the convoy, no work is done. The more we delay the convoy, the larger it gets and the worse performance we are likely to get.

I think we need to consider that this might be a theoretical limit of this setting, as one way to reduce the gap is to reduce the number of workers. The manager simply can't maintain that many workers without making some of them idle. Thus, I'd focus efforts on reducing the time workers are idle by finding ways of giving them tasks to do, rather than making the manager loop more efficient, as that can buy us only so much.

Some things to try:

Send on task every time we retrieve a task (a mini version of retrieve-many)
Send a task to a worker once it reports a previous task is done and not wait for this previous task to be retrieved. We only need to manage the disk, as all other resources were freed by the previous task.
See why retrieve-many is not giving the expected performance.
Use overcommit (likely to create larger convoys just used by itself).

dthain commented 1 year ago

Ok, Ben and I had a long and rambling talk about this. :)

@BarrySlyDelgado please go ahead and modify the main loop, so that immediately after each call to get_available_results(q,w) we then proceed to get the output files of each task from that worker. (Instead of waiting until later.). And then let's see how that affects this workflow.

BarrySlyDelgado commented 1 year ago

I'll do that. In the mean time, I have some results for different configurations. In this first graph the main loop changed where each time one task was received one task was also sent.

Screenshot from 2023-02-09 14-20-46

BarrySlyDelgado commented 1 year ago

This graph is the results from the 1st suggestion. The green and red dots are on top of each other so it looks like only red dots. Screenshot from 2023-02-09 14-25-15

dthain commented 1 year ago

Yes, we talked about that too! When exactly does the send happen -- Immediately after the send_results, or after the get_files? In any case, that's going to result in some interesting scheduling consequences, because the task will have no choice in which worker it is assigned to.

BarrySlyDelgado commented 1 year ago

This last graph is the result of merging both changes of the above graph. So, we only do a sync send/recv with one worker and then we send that worker a task. Screenshot from 2023-02-09 14-27-23

BarrySlyDelgado commented 1 year ago

The send happens after get_files.

dthain commented 1 year ago

Wow, that is a big change!

But... don't combine two things at once.

What is the impact of just the get_results change?

BarrySlyDelgado commented 1 year ago

This one:

This graph is the results from the 1st suggestion. The green and red dots are on top of each other so it looks like only red dots.

BarrySlyDelgado commented 1 year ago

This is an overview of some of the knobs we can adjust without adjusting the actual order of events. Changes are reflected from the graphs below. The current iteration of the main loop follows this order:

send results (synchronous) (all workers with available results)
receive one task (one task from any worker with available results) (back to the top)
send one task (not taken if receive one task happens)

These are some of the things we can change:

send results(SR) 1 or N (either do a sync call with 1 worker or N workers where N is the number of workers with available results)
receive task(RT) 1 or a1 or aN (receive one task from any worker or all tasks from 1 worker or all tasks from N workers)
send task(ST) 1 or a1 or aN (if you don't continue on RT: send one task, send as many tasks that can fit on 1 worker, send as many task that can fit on N workers)

So the current iteration of the main loop would be SR_N_RT_1 as we continue after receiving one task. This notation gives a general idea of what is going on in each graph.

This is a run with the original main loop for reference: Screenshot from 2023-02-10 15-00-08 Next we have the graph for receiving one task, not continuing and sending one task (SR_N_RT_1_ST1) Found here Next we have the graph for doing send results for N workers and retrieving all tasks from N workers (SR_N_RT_aN) This introduces a new function receive_all_tasks_from_worker(q,w) NOTE: These are for the blast example fitting one task to a worker so there is only one task on a worker. Found here Screenshot from 2023-02-10 15-26-52 Next we have the graph for doing send results with 1 worker and receiving all tasks from that worker(SR_1_RT_a1) Found here Notably this had a strange long result. I thought that it could be due to where I implemented receive_all_tasks_from_workers So I did a slightly different version. Found here however it came out worse. So look at this one skeptically. Screenshot from 2023-02-10 15-35-03 Screenshot from 2023-02-10 16-24-41 Lastly we have the graph for doing send results with 1 worker receiving all tasks from that worker and sending one task(SR_1_RT_a1_ST_1) Found here

On the last graph all cached executions end before 15 seconds which is the best. However retrieving tasks pushes the overall execution time past 40 seconds.

dthain commented 1 year ago

Barry, this is a great writeup. Will take me a bit to digest what's going on. SR_1_RT_a1 is very puzzling. Can you explain it?

BarrySlyDelgado commented 1 year ago

Not currently. It's possible I may have messed up doing something on that branch. I'll take a look at it a bit deeper to see if I can come up with anything.

dthain commented 1 year ago

Ok, I had a chance to look through the configurations. The final one makes sense to me, except I note that after polling all asynchronous updates, it then only retrieves all complete tasks from one worker. And that might explain all of the delayed results in that graph

Try one more modification: take SR_1_RT_a1_ST_1 and remove the break from line 4198

BarrySlyDelgado commented 1 year ago

Here is the resulting graph generated from the loop here Screenshot from 2023-02-14 11-02-12

BarrySlyDelgado commented 1 year ago

One more. This is the result of receiving all the tasks for N workers and sending N of them. here Screenshot from 2023-02-14 13-56-16

I think all the graphs show that there is definitely some balancing to be done between sending and receiving tasks to reduce the dead time between tasks on workers.

BarrySlyDelgado commented 1 year ago

Here is the fix for SR_1_RT_a1_ST_1 here essentially, we check if there are any more tasks needed to be sent. If there are no more tasks to be sent SR_1_RT_a1_ST_1 becomes SR_N_RT_aN. The bulk of the tasks(cached workers) finishes in roughly 13s compared to the original which finishes in roughly 17s. There are side by side comparisons in the slide deck for tomorrow. Screenshot from 2023-02-14 16-01-34

dthain commented 5 months ago

@BarrySlyDelgado this bug is fixed but we still need to report it out as part of research paper, so final bit is to gather all the data together, name the files nicely etc, and then post here where that is located. Then we can close this out.

cooperative-computing-lab / cctools

Vine: Main Loop events order #3108