Open WardLT opened 9 months ago
@WardLT What machine are you seeing this issue on?
I saw it on Perlmutter CPU. Here's the app: https://github.com/exalearn/fast-finetuned-forcefields/blob/main/notebooks/hydronet/1_initial-database/run_clusters.py#L155
I agree. Parsl should not be setting parallelization flags automatically when the default would do the same thing. It's not clear to me why this flag is being set, but it's definitely counterintuitive behavior.
@yadudoc it happens on all machines to the best of my knowledge.
In practice, this line should only be set if explicitly requested by the user, not filled in as a default. It is a scheduler option after all.
@WardLT @Andrew-S-Rosen can you comment on this issue in the context of the newly introduced htex MPI behaviour, #3016?
Personally, I'm not a huge fan of Parsl hard-coding directives that I can't change, especially when it is undocumented. What if I want to run a GPU calculation and don't want ntasks-per-node set at all? Maybe something like the following:
#!/bin/bash
#SBATCH -N 1
#SBATCH -C gpu
#SBATCH -G 4
#SBATCH -q debug
#SBATCH -J test
#SBATCH -t 00:30:00
#SBATCH -A myaccount
Edit: I guess one would only have this scenario if they are using MPI/OpenMP? How do we handle the OpenMP scenario?
Edit: I guess one would only have this scenario if they are using MPI/OpenMP?
there's this much neglected ongoing conflict between different layers all assuming they are the ones in charge of allocating all the resources - it's why people set OMP_NUM_THREADS=1
when they're running htex, because some entity outside of either omp or htex needs to coordinate that the product of all components beliefs in what they can allocate multiplies to be exactly the number of cores. (you could also not set OMP_NUM_THREADS and set htex max_workers=1, or any other appropriate factorisation)
In the context of this issue #3006, htex "owns" the right to set MPI parameters: a task is an instance of process worker pool, and it wants 1 per node, so ntasks per node is 1.
If someone comes along later and tries to redefine inside an already running task what a task is, in order to use the same word to run a different number of this different kind of tasks... that's just a variant of the abovementioned neglected ongoing conflict...
@WardLT suggests (I think) above setting tasks per node on the srun (or other launcher) that launches process_worker_pool and interestingly that parameter is already passed into the launcher Python code, but not onwards to the srun
command line... so at least in the sbatch+srun case (assuming no other launchers used with sbatch), it's a superficially simple change to move that call from the sbatch template to the srun template...
@WardLT suggests (I think) above setting tasks per node on the srun (or other launcher) that launches process_worker_pool
I set "tasks-per-node" in the system call to srun executed inside my Python function. I do not use srun
to launch the process_worker_pool.
(I edited the first comment to say where to put the --tasks-per-node)
ok, so in that case, moving --tasks-per-node from the sbatch provider into the srun launcher would: i) hopefully not change behaviour for the regular parsl case where 1-per-node process_worker_pool is launched via SBatchProvider + SRunLauncher ii) leave you at the mercy of whichever defaults sbatch gives you for sbatchprovider + single process_worker_pool + your own srun calls I think?
That sounds right. I'm not sure how much the defaults vary per Slurm system, so cannot speak to how much ii is a problem.
This Issue came up again today, and I'm becoming more in favor of moving the tasks-per-node
to the SRunLauncher. We could be aggressive about specifying the options of the srun
launcher to be resilient against changes in system defaults.
In my case, I was using a SlurmProvider
configured for a MPIExecutor
as follows:
def configure_mpi_executor(name, config):
e = MPIExecutor(
label=name,
mpi_launcher="srun",
max_workers_per_block=config["max mpi apps"],
provider=SlurmProvider(
channel=LocalChannel(),
exclusive=True,
cores_per_node=config["cores per node"],
nodes_per_block=config["nodes per block"],
init_blocks=1,
partition=config["partition"],
account=config["account"],
walltime="08:00:00",
launcher=SimpleLauncher(),
worker_init="""
""",
),
)
return e
My problem was that Parsl injected a #SBATCH --cpus-per-task=8
directive into the Parsl pilot job where the value 8
came from the cores_per_node
field of the SlurmProvider
defined above.
The reason that is a problem is because Slurm considers an MPI rank to be a "task" and #SBATCH --cpus-per-task=<cores per node>
tells Slurm that each MPI rank gets all the cores on the node. The downstream effect of that is any srun
commands used inside the pilot to launch MPI Apps will not be allowed to run more than one rank per node because the first rank consumes all the cpus on the node. The resulting error from Slurm:
srun: error: Unable to create step for job 89: More processors requested than permitted
Given that the $PARSL_MPI_PREFIX
already automatically contains the appropriate srun
options for requesting total ranks, ranks per node, and number of nodes (obtained from parsl_resource_specification
), there is no advantage or reason to inject the #SBATCH --cpus-per-task
into the pilot job at all; it serves no purpose and only interferes with resources available to the Apps being executed in the pilot job resources.
Is your feature request related to a problem? Please describe. Parsl interferes with my
mpiexec
launcher settings when I use the Slurm provider. I did not know the Slurm provider sets "ntasks-per-node" in the SBATCH header, so thempiexec
orsrun
in my bash_apps did not behave as I expected.Describe the solution you'd like Either parsl to not set this flag when in running MPI tasks or at least to warn me that it does.
Describe alternatives you've considered Putting a warning in the documentation.
Additional context
--ntasks-per-node 1
in thesrun
call made inside your Parsl App