Parsl makes my MPI tasks behave unexpectedly when using SlurmProvider

WardLT commented 9 months ago

Is your feature request related to a problem? Please describe. Parsl interferes with my mpiexec launcher settings when I use the Slurm provider. I did not know the Slurm provider sets "ntasks-per-node" in the SBATCH header, so the mpiexec or srun in my bash_apps did not behave as I expected.

Describe the solution you'd like Either parsl to not set this flag when in running MPI tasks or at least to warn me that it does.

Describe alternatives you've considered Putting a warning in the documentation.

Additional context

Seen by a user recently and by me a few months ago.
You can work around it by setting --ntasks-per-node 1 in the srun call made inside your Parsl App

yadudoc commented 9 months ago

@WardLT What machine are you seeing this issue on?

WardLT commented 9 months ago

I saw it on Perlmutter CPU. Here's the app: https://github.com/exalearn/fast-finetuned-forcefields/blob/main/notebooks/hydronet/1_initial-database/run_clusters.py#L155

Andrew-S-Rosen commented 8 months ago

I agree. Parsl should not be setting parallelization flags automatically when the default would do the same thing. It's not clear to me why this flag is being set, but it's definitely counterintuitive behavior.

@yadudoc it happens on all machines to the best of my knowledge.

Andrew-S-Rosen commented 8 months ago

In practice, this line should only be set if explicitly requested by the user, not filled in as a default. It is a scheduler option after all.

benclifford commented 7 months ago

@WardLT @Andrew-S-Rosen can you comment on this issue in the context of the newly introduced htex MPI behaviour, #3016?

Andrew-S-Rosen commented 7 months ago

Personally, I'm not a huge fan of Parsl hard-coding directives that I can't change, especially when it is undocumented. What if I want to run a GPU calculation and don't want ntasks-per-node set at all? Maybe something like the following:

#!/bin/bash

#SBATCH -N 1
#SBATCH -C gpu
#SBATCH -G 4
#SBATCH -q debug
#SBATCH -J test
#SBATCH -t 00:30:00
#SBATCH -A myaccount

Edit: I guess one would only have this scenario if they are using MPI/OpenMP? How do we handle the OpenMP scenario?

benclifford commented 7 months ago

Edit: I guess one would only have this scenario if they are using MPI/OpenMP?

there's this much neglected ongoing conflict between different layers all assuming they are the ones in charge of allocating all the resources - it's why people set OMP_NUM_THREADS=1 when they're running htex, because some entity outside of either omp or htex needs to coordinate that the product of all components beliefs in what they can allocate multiplies to be exactly the number of cores. (you could also not set OMP_NUM_THREADS and set htex max_workers=1, or any other appropriate factorisation)

In the context of this issue #3006, htex "owns" the right to set MPI parameters: a task is an instance of process worker pool, and it wants 1 per node, so ntasks per node is 1.

If someone comes along later and tries to redefine inside an already running task what a task is, in order to use the same word to run a different number of this different kind of tasks... that's just a variant of the abovementioned neglected ongoing conflict...

@WardLT suggests (I think) above setting tasks per node on the srun (or other launcher) that launches process_worker_pool and interestingly that parameter is already passed into the launcher Python code, but not onwards to the srun command line... so at least in the sbatch+srun case (assuming no other launchers used with sbatch), it's a superficially simple change to move that call from the sbatch template to the srun template...

WardLT commented 7 months ago

@WardLT suggests (I think) above setting tasks per node on the srun (or other launcher) that launches process_worker_pool

I set "tasks-per-node" in the system call to srun executed inside my Python function. I do not use srun to launch the process_worker_pool.

(I edited the first comment to say where to put the --tasks-per-node)

benclifford commented 7 months ago

ok, so in that case, moving --tasks-per-node from the sbatch provider into the srun launcher would: i) hopefully not change behaviour for the regular parsl case where 1-per-node process_worker_pool is launched via SBatchProvider + SRunLauncher ii) leave you at the mercy of whichever defaults sbatch gives you for sbatchprovider + single process_worker_pool + your own srun calls I think?

WardLT commented 7 months ago

That sounds right. I'm not sure how much the defaults vary per Slurm system, so cannot speak to how much ii is a problem.

WardLT commented 3 months ago

This Issue came up again today, and I'm becoming more in favor of moving the tasks-per-node to the SRunLauncher. We could be aggressive about specifying the options of the srun launcher to be resilient against changes in system defaults.

christopherwharrop-noaa commented 3 months ago

In my case, I was using a SlurmProvider configured for a MPIExecutor as follows:

def configure_mpi_executor(name, config):

    e = MPIExecutor(
        label=name,
        mpi_launcher="srun",
        max_workers_per_block=config["max mpi apps"],
        provider=SlurmProvider(
            channel=LocalChannel(),
            exclusive=True,
            cores_per_node=config["cores per node"],
            nodes_per_block=config["nodes per block"],
            init_blocks=1,
            partition=config["partition"],
            account=config["account"],
            walltime="08:00:00",
            launcher=SimpleLauncher(),
            worker_init="""
            """,
        ),
    )
    return e

My problem was that Parsl injected a #SBATCH --cpus-per-task=8 directive into the Parsl pilot job where the value 8 came from the cores_per_node field of the SlurmProvider defined above.

The reason that is a problem is because Slurm considers an MPI rank to be a "task" and #SBATCH --cpus-per-task=<cores per node> tells Slurm that each MPI rank gets all the cores on the node. The downstream effect of that is any srun commands used inside the pilot to launch MPI Apps will not be allowed to run more than one rank per node because the first rank consumes all the cpus on the node. The resulting error from Slurm:

srun: error: Unable to create step for job 89: More processors requested than permitted

Given that the $PARSL_MPI_PREFIX already automatically contains the appropriate srun options for requesting total ranks, ranks per node, and number of nodes (obtained from parsl_resource_specification), there is no advantage or reason to inject the #SBATCH --cpus-per-task into the pilot job at all; it serves no purpose and only interferes with resources available to the Apps being executed in the pilot job resources.

Parsl / parsl

Parsl makes my MPI tasks behave unexpectedly when using SlurmProvider #3006