Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
482 stars 194 forks source link

don't set --exclusive by default in slurm provider #2615

Open rueberger opened 1 year ago

rueberger commented 1 year ago

When --exclusive is set in slurm, which is done by default in the slurm provider, the number of cpus requested per node is ignored as the entire node is assigned by slurm. This is a subtle footgun which overrides attempts by the user to allocate resources at a finer-grain scale than whole-nodes.

Furthermore, parsl only spins up the number of workers originally requested by the user. For example, this config will spin up four workers per node, leaving 20 cores idle on a cluster with 24 core nodes:

config = parsl.config.Config(
    executors=[
        parsl.executors.HighThroughputExecutor(
            "htex",
            cores_per_worker=1,
            working_dir=os.path.expanduser('~/tmp/parsl/test'),
            provider=parsl.providers.SlurmProvider(
                partition="default,
                nodes_per_block=1,
                cores_per_node=4,
                launcher=parsl.launchers.SrunLauncher(),
            ),
        )
    ]
)

As far as I can tell, the only reason to set --exclusive is in clusters where oversubscription is allowed.

exclusive should be set to False by default, or at the very least, a warning should be raised when exclusive is set to True and cores_per_node is not None.

benclifford commented 1 year ago

I haven't forgotten about this issue.

cores_per_node is also used in the scaling code for another purpose, to figure out how many new workers to expect when starting (or ending) a block of workers: so it would be an expected situation to see both exclusive and cores_per_node values set at the same time.

The use of the cores_per_node parameter for multiple-but-similar purposes bothers me but I don't have a good feeling for what the user interface should be changed to look like.

I think, though, it's probably right to set exclusive to false by default: to inherit the default behaviour of the underlying batch system.