Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
482 stars 194 forks source link

Remove or clarify naming of `max_workers` #1353

Open annawoodard opened 4 years ago

annawoodard commented 4 years ago

max_workers is just an alternative way of defining cores_per_worker (except on heterogenous queues, where you probably want cores_per_worker anyways). It is just another thing for users to have to understand, and it is confusing whether this means max workers per node, per block, or across all blocks. Misconfiguring max_workers can seriously affect scaling in unexpected ways (see #1341).

I favor removing it entirely. If there is resistance to that idea, we should at least rename to max_workers_per_manager.

annawoodard commented 4 years ago

cc @mattwelborn @Lnaden

Lnaden commented 4 years ago

If you remove the max_workers and instead use only cores_per_worker, then what is the way to limit the number of workers which can be created? If you define max_blocks, cores_per_worker, nodes_per_block, and cores_per_node (resource hint), then all that says is how to allocate the resources between the workers, and with some math you can work out the max workers as:

max_workers = max_blocks nodes_per_block cores_per_node / cores_per_worker

If you assume that workers cannot be oversubscribed (1 task per worker), then you can also work out the max tasks. But, Parsl provides the ability to oversubscribe workers by setting cores_per_worker < 1. From the HTEX API docs:

cores_per_worker (float) – cores to be assigned to each worker. Oversubscription is possible by setting cores_per_worker < 1.0. Default=1

How can we prevent Parsl from then trying to just overassign tasks to the workers such that each worker is only able to process 1 task? Or am I misreading that and the oversubscription is on cores (i.e. 1 core can be assigned to multiple workers)?

annawoodard commented 4 years ago

max_workers = max_blocks nodes_per_block cores_per_node / cores_per_worker

This is not quite right. max_workers is the max workers per manager, so it is independent of max_blocks and nodes_per_block.

Say you have 24-core nodes. Currently, you can define max_workers=4, or equivalently, you can define cores_per_worker=6. Both settings will lead to 4 workers launched on per node, whether or not you define cores_per_node (because the manager, which launches the workers, is on the node and can figure out how many cores are on the node). What defining cores_per_node does is allow the scaling machinery of Parsl to make a better guess about how many workers it will get given a certain number of blocks launched.

How can we prevent Parsl from then trying to just overassign tasks to the workers such that each worker is only able to process 1 task? Or am I misreading that and the oversubscription is on cores (i.e. 1 core can be assigned to multiple workers)?

The oversubscription is on cores-- one core can be assigned to multiple workers. Each worker always runs one task.

annawoodard commented 4 years ago

(so to clarify: if max_workers were eliminated, the way to limit the number of workers would be with cores_per_worker)

Lnaden commented 4 years ago

I should have been a bit more clear, the equation I wrote, assuming the control variable max_workers is removed, was more meant to be "Calculate the maximum number of concurrent workers, and thus tasks, which this Parsl Config can run." In that context, is the equation correct?

annawoodard commented 4 years ago

Ah yes, sorry I misunderstood-- your equation is correct.

So, since each worker runs one task, does that clarify your original question?

Lnaden commented 4 years ago

Yes it does! And there is no way for 1 worker to run multiple tasks at the same time?

On a related question, if one does not provide the resource hints, how does Parsl determine the number of cores each node has for workers to consume? i.e., if you request a node which has 24 cores and launch 1 block on the whole node, when/how will Parsl detect the node is consumed?