cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
134 stars 119 forks source link

[WQ] how does a worker determine what cores are used for which tasks? #3883

Closed svandenhaute closed 3 months ago

svandenhaute commented 3 months ago

WQ has specific functionality which assigns GPUs to tasks, and it keeps track of which GPUs are assigned to which tasks. Does something similar exist for CPUs?

For example, consider a worker with 64 cores, which suddenly receives tasks of 32 cores each. It knows that it has just enough resources to start both tasks, so it will start both tasks. How do the 64 cores get distributed for each task? For specific, within a task, what would be the output of e.g.

python -c 'import os; print(os.sched_getaffinity(0))'

Suppose that these tasks benefit greatly from process / thread binding to cores (e.g. PyTorch). How should the binding be performed in this case? It is unknown which cores actually belong to which task. For GPUs, this is conveniently handled by WQ itself since he just renames the available GPUs per task starting at 0, so tasks can just assume GPU 0 is available (if only one is requested), or GPU 0,1 are available( if two are requested) etc ... (If I read the code correctly, this is mostly implemented in work_queue_gpus.c

tagging @benclifford because this came up in a discussion with him

dthain commented 3 months ago

Work Queue (and TaskVine) do not currently bind tasks to specific cores. As you note, they simply ensure that the total number of cores is respected, and let the OS takes care of the assignments in the usual way. This is in part because tasks are (often) not single processes but often complex process trees.

GPUs are different because they are not really managed in any effective way by the OS. Unless otherwise instructed, applications tend to use whatever GPUs they can get their hands on. And so, we have to do some specific assignments to avoid chaos.

Is there a specific reason that you would like a specific binding to be done?

svandenhaute commented 3 months ago

I'm in the situation where a single task consists of running a number of separate Python processes in parallel in the background, each of which loads and performs PyTorch CPU and/or GPU calculations:

python client.py --device=0 &
python client.py --device=1 &
python client.py --device=2 &
...

I've noticed that the CPU performance of each client degrades severely the more clients are present, even though I make sure that the ratio (ncores / nclients) is constant. Something inefficient is therefore going on, but there can be a number of reasons. PyTorch's performance tuning guide mentions that for efficient multithreading, it's important to set some env variables such as OMP_PROC_BIND, OMP_PLACES, KMP_AFFINITY e.g. Even with trial and error, I cannot get consistently good performance. The best I had was with OMP_PROC_BIND=false and KMP_AFFINITY=granularity=fine,compact,1,0, but it's still not great.

Since the GPU performance is fine, I'm expecting that it has to do with processes / threads interfering with each other. Do you have suggestions on how to tackle this? At any given moment, it is possible that a single worker needs to execute multiple tasks. For example, if the worker is spawned on a compute node with 64 cores and 8 GPUs, it is possible that it needs to execute two tasks, which each require four clients to be spawned.

dthain commented 3 months ago

Hmm, that's a tricky one. One possible problem is this: if each task is attempting to use all the available cores on the machine, then running more clients will just slow things down.

In general, we try to force/encourage tasks to only use the resources that they have been assigned. For example, WQ automatically sets OMP_NUM_THREADS for each task to correspond to whatever was set with task.set_cores(n). code

But there a number of ways that apps and libraries may escape or ignore those controls. It may be necessary to "try harder" to get them to respect the assigned resources.

@btovar could you give us some suggestions here on how to instrument the tasks, and determine whether they are using the assigned number of cores?

benclifford commented 3 months ago

as a comparison, Parsl's high throughput executor can pin specific workers to specific cores:

https://github.com/Parsl/parsl/blob/c3df044b862bd93cd492332217a6e7d9b493a87a/parsl/executors/high_throughput/process_worker_pool.py#L732

The use case there is not about oversubscribing the CPU / controlling the number of concurrent processes/threads...

... but about pinning specific tasks to stay on specific cores, to avoid migrating processes between cores - so that user's work processes are pinned to a specific set of cores chosen by the parsl high throughput executor worker pool. Having parsl's worker pool choose the pins means that the pins don't overlap with other work, which parsl chooses to pin to different cores. (that's basically the same as GPU handling which is always this explicit, I think)

In Parsl land, that's especially something that people running on ALCF's Polaris seem interested in (to do with locality of things that I don't know much about), but I've also personally wanted to do it on NERSC machines.

benclifford commented 3 months ago

(more generally in my head, there are fungible and non-fungible things to allocate: the fungible ones we can just count (the "don't overload the CPU" approach to cpu allocation). The non-fungible ones we have to name and allocate. GPUs, for example, and cores-that-we-want-to-pin in this issue, but also in parsl we do some MPI rank allocation stuff where we need to allocate particular MPI ranks to run stuff on for different tasks)

btovar commented 3 months ago

Since cores are not oversubscribed, I wonder if we can do this entirely from the workers' side. The worker calls set_affinity as needed and assigns cores to tasks as cores become available. I don't think that the manager needs to keep track which cores are assigned to each task? It simply asks the worker to use N core for a task, and not let the OS move it once it is running?

benclifford commented 3 months ago

@btovar in parsl this happens entirely in process_worker_pool.py which is the moral equivalent of work_queue_worker - the executor process that knows about everything that's happening on this one node (this OS instance... the entire space of cores over which the OS might move user work processes around in)

so that means "yes"?

dthain commented 3 months ago

Hold on, I think we are moving toward a solution without first fully understanding the problem.

Sander is telling us that adding more CPU-bound processes slows the work down. That at least sounds like the CPUs are oversubscribed by accident. (And affinity doesn't solve that problem.) I think what we need to do first is measure what those tasks are doing first. BenT, can you suggest the best way for Sander to observe how many cores are actually used by each task?

btovar commented 3 months ago

We can give something like this a try, assuming that the resource_monitor was installed along side work queue from CCTools:

resource_monitor -Omon.summary  python client.py --device=0

At the end of execution mon.summary should have the resources consumed.

dthain commented 3 months ago

Also, this just occurred to me. Suppose that you define a task t and set t.set_cores(32) to allocate 32 cores for it. This will case WQ to set OMP_NUM_THREADS to 32 in the environment of that task. Under the assumption that OpenMP will read that environment variable and use 16 threads.

However, if your task consists of a shell script that runs multiple processes like you indicated:

python client.py --device=0 &
python client.py --device=1 &
python client.py --device=2 &

Then each of those three processes will read OMP_NUM_THREADS and go ahead and start using 32 threads. So then the task overall is using 3x32 threads, which perhaps is not the intended outcome.

benclifford commented 3 months ago

this should be pretty easy to measure - i suggested to @svandenhaute elsewhere to try a manual setup of what a node would look like with cpu pinning (not using WQ) and in that setup, looking at who is trying to use how much CPU should also be visible

svandenhaute commented 3 months ago

Thanks for all the suggestions, am now recreating a manual example outside of WQ. I have resource_monitor installed.

Also, this just occurred to me. Suppose that you define a task t and set t.set_cores(32) to allocate 32 cores for it. This will case WQ to set OMP_NUM_THREADS to 32 in the environment of that task. Under the assumption that OpenMP will read that environment variable and use 16 threads.

However, if your task consists of a shell script that runs multiple processes like you indicated:

python client.py --device=0 &
python client.py --device=1 &
python client.py --device=2 &

Then each of those three processes will read OMP_NUM_THREADS and go ahead and start using 32 threads. So then the task overall is using 3x32 threads, which perhaps is not the intended outcome.

Very valid point, I normally print the number of threads PyTorch uses (via torch.get_num_threads()) and this never behaved like how you suggested but I will look into why the behavior you suggest is not happening, because it does sound reasonable -- my task descriptions do have a requirement of ncores = nclients x ncores_per_client so the number of threads should be 'wrongly ' set. I'll report back

svandenhaute commented 3 months ago

summary_8client.txt summary_1client.txt

This is what I get from resource_monitor when applied to one client, in a simulation with 1 client, N cores per client, and amount X to compute per client. The other file is resource_monitor applied to one of eight clients, with still N cores per client, and still an amount X to compute per client (i.e. in total eight times more cores and eight times more compute work which gets distributed equally among the clients).

svandenhaute commented 3 months ago

I have not tried to properly set e.g. OMP_NUM_THREADS to N and override SLURM's default of 8N when asking for a job with 1 task and 8N CPUs per task (mimicking the actual scenario that will happen within the WQ/Parsl framework. I just want to check this is the output you were looking for? Ideally the runtime of both processes should be pretty much identical, but currently there's a factor two difference. Is the context switching thing a concern? What should I be aiming for?

btovar commented 3 months ago

The wall time seems close? 102s for 1client, and 129s for 8 clients. However the cpu time for 8 clients is much larger (~30s vs ~157s). If they are performing the same kind work, is this the overhead you were talking about?

svandenhaute commented 3 months ago

How is the CPU time computed? Is it the cumulative time that any of the threads in the process was actually getting executed on any of the cores? In that case, yes, that overhead is nonsensical. The timing difference is indeed not super big in my debug case, but there are some differences with the 'real-world' scenario in terms of environment settings.

When printing CPU affinities, it seems that SLURM's native setup ensures that the 1-client process gets bound to 7x2=14 hyperthreads, whereas in the case of 8 clients the affinity is simply not set (i.e. it looks like it's bound to all available hyperthreads on the node)

svandenhaute commented 3 months ago

Also, this just occurred to me. Suppose that you define a task t and set t.set_cores(32) to allocate 32 cores for it. This will case WQ to set OMP_NUM_THREADS to 32 in the environment of that task. Under the assumption that OpenMP will read that environment variable and use 16 threads.

However, if your task consists of a shell script that runs multiple processes like you indicated:

python client.py --device=0 &
python client.py --device=1 &
python client.py --device=2 &

Then each of those three processes will read OMP_NUM_THREADS and go ahead and start using 32 threads. So then the task overall is using 3x32 threads, which perhaps is not the intended outcome.

This was indeed the issue -_- It was complicated by affinities being arbitrarily changed by python imports (torch), and the fact that our cluster's file system is very slow, which just made the startup time of each of the clients terrible. In the end, "optimal" performance did not require manual process binding.

Thanks for the help!

dthain commented 3 months ago

Glad you were able to work it out. I'm going to change this issue over to a "Discussion" in case others run into a similar issue.