Open germa89 opened 2 months ago
srun
on login nodeRunning the following script (srun_ansys.sh
):
export ANS_MULTIPLE_NODES=1
export HYDRA_BOOTSTRAP=slurm
echo "Starting..."
srun --export=all --job-name=srun_ansys --partition=qsmall --nodes=6 --ntasks-per-node=8 bash -c "hostname && /ansys_inc/v242/ansys/bin/ansys242 -grpc -dis -np 48 -b -i /home/ubuntu/input.inp "
as:
./srun_ansys.sh
in a login node, gives MAPDL starting in each node INDIVIDUALLY.
So MAPDL should never be executed with srun
if we want to have only one instance spawning accross multiple nodes.
That makes sense.
If you want to spawn one single instance you should do something like above, using sbatch
. This is all because MAPDL does uses MPI, so the srun is not needed.
So it seems we will have to go with sbatch
if we ever want to implement SLURM support.
BatchHost
node: it is the main node in the job allocated nodes. If your submitted script starts single core processes, they will normally run there. For instance, in case of MAPDL, this is the node that hold the main MAPDL process, the other nodes hold the "slave" processes. This node also is the one you need to connect to when using the gRPC server.srun
to run stufff.srun
to start MAPDL.I envision something like:
PyMAPDL is on the login node. There we start a python console and we write:
from ansys.mapdl.core import launch_mapdl
slurm_options = {
"partition": "qsmall",
"nodes": 6,
"ntasks-per-node": 8
}
mapdl = launch_mapdl(slurm_options=slurm_options)
# if no `slurm_options` are used, we should specify something like:
# mapdl = launch_mapdl(ON_SLURM=True)
# So we do the switch. Because there is no way to know if this a normal machine or a login node of an HPC cluster (check for sbatch process??)
mapdl.prep7() ...
which under the hood will use an sbatch
command to run something like:
# --partition=qsmall --job-name=sbatch_ansys --ntasks=16 --nodes=8 --ntasks-per-node=2 --export=all
sbatch %slurm_options% --wrap "/ansys_inc/v242/ansys/bin/ansys242 -grpc -dis -np 8 -b -i /home/ubuntu/input.inp"
Proper handleing of options should be done, but each MAPDL instance will be an independent SLURM job, so they can utilize all the resources allocated for that job. That simplify stuff a lot.
One complication could be how to obtain the IP of the remote instance. We can use sbatch
option --parsable
so it does return only the job ID (and cluster if any).
With the job ID, we can query the list of nodes, and we can query the BatchHost
hostname using:
scontrol show job $JOBID | grep BatchHost | cut -d'=' -f2
We can use BatchHost
as ip
argument on launch_mapdl
.
For the MAPDL pool case, there should be no problem being the same as with launch_mapdl
since each MAPDL instance will have its own SLURM job.
We just need to make sure that the SLURM configuration will not represent the aggregated values, but the values for each single job.
For instance:
from ansys.mapdl.core import MapdlPool
slurm_options = {
"partition": "qsmall",
"nodes": 6,
"ntasks-per-node": 8
}
pool = MapdlPool(8, slurm_options=slurm_options)
# Spawn 8 MAPDL instances with 48 cores each (6 nodes each one and 8 task per node)
I think making the divisions by ourselves will just complicate stuff.
On the login node we will do something like:
sbatch my_script.sh
where my_script.sh
contains:
# Define SLURM options here
source /path/to/my/venv/bin/activate
python my_python_script
In this case, the under the hood we will a normal ansysXXX
call with extra environment varibles which activates the multiple nodes (see this issue body).
In theory, the IP remains unchanged (as local, 127.0.0.1
).
This case seems simple. The code is managed as non-hpc local, only the env vars need to be in place.
For this case, things get a bit trickier.
We have to options:
Do not divide the resources. We just submit an srun
and let the SLURM manage the stuff.
It will create multiple MAPDL processes which spawn across same nodes. For instance:
MapdlPool 0: 2 Mapdl instances, on 4 nodes
This might not be a reasonable resource sharing scheme. The next mode proposes something different.
Divide by ourselves the resources. We divide the number of nodes across the number of instances. For instance:
MapdlPool 0: 2 Mapdl instances, on 4 nodes
The first case is the most simple to implement but I can see the performance is not as good as the second.
I guess from the user perspective, we should keep things simple.
I can imagine an option (bool) like share_nodes
that if True
, it will use the first method.
When false
, it will use the second.
from ansys.mapdl.core import MapdlPool
slurm_options = {
"partition": "qsmall",
"nodes": 16,
"ntasks-per-node": 1
}
# All MAPDL instances spawn across all nodes.
# using MAPDL -nproc 16
pool = MapdlPool(8, slurm_options=slurm_options, share_nodes=True)
# All MAPDL instances spawn across all nodes.
# using MAPDL -nproc 16 and:
# On first MAPDL instance: srun --nodes=node0-1
# On second MAPDL instance: srun --nodes=node2-3
# and so on...
pool = MapdlPool(8, slurm_options=slurm_options, share_nodes=False)
But what should happen when the resources are not evenly splitted? For instance, the job has 5 nodes allocated, but we are requesting only 4 MAPDL instances.
I guess we should fallback to mode 1 (share_nodes=True
) if no mode is specified (share_nodes=None
)?
I guess it will depend on how much penalization there is between both schemes.
Anyways, a warning message is always appreciated.
Feedback is appreciated here too.
launch_mapdl |
MapdlPool |
|
---|---|---|
Interactive | srun |
srun |
Batch | None | None (share_nodes=True ) |
srun (share_nodes=False ) |
I think this should be converted to a document section How to run PyMAPDL of HPC
... this should also include how to run MAPDL using the mentioned tight integration.
Comment moved to a new issue: https://github.com/ansys/pymapdl/issues/3467
Disclaimer: this is a very quick and dirty draft. It is not final info. Do not trust this information blindly
To run MAPDL across multiple nodes in an HPC cluster, there are two approaches.
In both cases, the env var
ANS_MULTIPLE_NODES
is needed. It does several things, for instance it detects the infinity band network interface.Default
You have specify the name of the compute nodes (machines) and the number of CPUs used (this can be obtained from the SLURM env vars).
Example script
Tight integration
You can rely on MPI tight integration with the scheduler by setting
HYDRA_BOOTSTRAP
env var equal to the scheduler.Notes