Document HPC tight integration

germa89 commented 2 months ago

Disclaimer: this is a very quick and dirty draft. It is not final info. Do not trust this information blindly

To run MAPDL across multiple nodes in an HPC cluster, there are two approaches.

Default: You specify the machines
Tight integration: you rely on the scheduler-mpi integration

In both cases, the env var ANS_MULTIPLE_NODES is needed. It does several things, for instance it detects the infinity band network interface.

Default

You have specify the name of the compute nodes (machines) and the number of CPUs used (this can be obtained from the SLURM env vars).

Example script

#!/bin/bash
#SBATCH --job-name=ansys_job2            # Job name
#SBATCH --partition=qsmall              # Specify the queue/partition name                  
#SBATCH --nodes=6                       # Number of nodes
#SBATCH --ntasks-per-node=8             # Number of tasks (cores) per node
#SBATCH --time=04:00:00                 # Set a time limit for the job (optional)
#SBATCH --output=ansys_job2.out          # Standard output file
#SBATCH --error=ansys_job2.err           # Standard error file

export ANS_CMD_NODIAG="TRUE"
export ANS_MULTIPLE_NODES=1

MACHINES=$(srun hostname | sort | uniq -c | awk '{print $2 ":" $1}' | paste -s -d ":" -)
echo "Machines: $MACHINES"

# Run the Ansys process
echo "Starting Ansys MAPDL..."

/ansys_inc/v242/ansys/bin/ansys242 -dis -machines=$MACHINES -np $SLURM_NTASKS -b -i /home/ubuntu/input.inp

Tight integration

You can rely on MPI tight integration with the scheduler by setting HYDRA_BOOTSTRAP env var equal to the scheduler.

#!/bin/bash
#SBATCH --job-name=ansys_job2            # Job name
#SBATCH --partition=qsmall              # Specify the queue/partition name                  
#SBATCH --nodes=6                       # Number of nodes
#SBATCH --ntasks-per-node=8             # Number of tasks (cores) per node
#SBATCH --time=04:00:00                 # Set a time limit for the job (optional)
#SBATCH --output=ansys_job2.out          # Standard output file
#SBATCH --error=ansys_job2.err           # Standard error file

export ANS_CMD_NODIAG="TRUE"  # Remove popups
export ANS_MULTIPLE_NODES=1  # Set multiple nodes

export HYDRA_BOOTSTRAP=slurm  # Activate tight coupling
export I_MPI_HYDRA_BOOTSTRAP=slurm  # Activate tight coupling on Intel MPI (optional if the above is set)

echo "Number of tasks: $SLURM_NTASKS"

# Run the Ansys process
echo "Starting Ansys MAPDL..."
/ansys_inc/v242/ansys/bin/ansys242 -dis -np $SLURM_NTASKS -b -i /home/ubuntu/input.inp

Notes

Tested on HPS pCluster
Related issue #2865
I should probably add/reference this info to/in https://wiki.ansys.com

germa89 commented 2 months ago

Testing `srun` on login node

Running the following script (srun_ansys.sh):

export ANS_MULTIPLE_NODES=1
export HYDRA_BOOTSTRAP=slurm

echo "Starting..."

srun --export=all --job-name=srun_ansys --partition=qsmall --nodes=6 --ntasks-per-node=8 bash -c "hostname && /ansys_inc/v242/ansys/bin/ansys242 -grpc -dis -np 48 -b -i /home/ubuntu/input.inp "

as:

./srun_ansys.sh

in a login node, gives MAPDL starting in each node INDIVIDUALLY.

So MAPDL should never be executed with srun if we want to have only one instance spawning accross multiple nodes. That makes sense.

If you want to spawn one single instance you should do something like above, using sbatch. This is all because MAPDL does uses MPI, so the srun is not needed.

So it seems we will have to go with sbatch if we ever want to implement SLURM support.

germa89 commented 2 months ago

Terminology

Head/login node: node where you log into. You can run stuff there, but it is normally not recommended. However for quick testing or experimentation it can be allowed.
BatchHost node: it is the main node in the job allocated nodes. If your submitted script starts single core processes, they will normally run there. For instance, in case of MAPDL, this is the node that hold the main MAPDL process, the other nodes hold the "slave" processes. This node also is the one you need to connect to when using the gRPC server.
Compute node. Other nodes allocated for the job. They are utilized when using MPI programs or srun to run stufff.

SLURM support

If running locally a python job in the head node -> use sbatch to start MAPDL.
If running the full job in the compute nodes -> use srun to start MAPDL.

Design

I envision something like:

Running interactively.

PyMAPDL is on the login node. There we start a python console and we write:

from ansys.mapdl.core import launch_mapdl

slurm_options = {
    "partition": "qsmall",
    "nodes": 6,
    "ntasks-per-node": 8
}
mapdl = launch_mapdl(slurm_options=slurm_options)

# if no `slurm_options` are used, we should specify something like:
# mapdl = launch_mapdl(ON_SLURM=True)
# So we do the switch. Because there is no way to know if this a normal machine or a login node of an HPC cluster (check for sbatch process??)

mapdl.prep7() ...

which under the hood will use an sbatch command to run something like:

# --partition=qsmall --job-name=sbatch_ansys --ntasks=16 --nodes=8 --ntasks-per-node=2 --export=all 
sbatch %slurm_options% --wrap "/ansys_inc/v242/ansys/bin/ansys242 -grpc -dis -np 8 -b -i /home/ubuntu/input.inp"

Proper handleing of options should be done, but each MAPDL instance will be an independent SLURM job, so they can utilize all the resources allocated for that job. That simplify stuff a lot.

One complication could be how to obtain the IP of the remote instance. We can use sbatch option --parsable so it does return only the job ID (and cluster if any). With the job ID, we can query the list of nodes, and we can query the BatchHost hostname using:

scontrol show job $JOBID | grep BatchHost | cut -d'=' -f2

We can use BatchHost as ip argument on launch_mapdl.

MapdlPool case

For the MAPDL pool case, there should be no problem being the same as with launch_mapdl since each MAPDL instance will have its own SLURM job. We just need to make sure that the SLURM configuration will not represent the aggregated values, but the values for each single job. For instance:

from ansys.mapdl.core import MapdlPool

slurm_options = {
    "partition": "qsmall",
    "nodes": 6,
    "ntasks-per-node": 8
}
pool = MapdlPool(8, slurm_options=slurm_options)
# Spawn 8 MAPDL instances with 48 cores each (6 nodes each one and 8 task per node)

I think making the divisions by ourselves will just complicate stuff.

Running batch mode

On the login node we will do something like:

sbatch my_script.sh

where my_script.sh contains:

# Define SLURM options here
source /path/to/my/venv/bin/activate
python my_python_script

In this case, the under the hood we will a normal ansysXXX call with extra environment varibles which activates the multiple nodes (see this issue body). In theory, the IP remains unchanged (as local, 127.0.0.1).

This case seems simple. The code is managed as non-hpc local, only the env vars need to be in place.

MapdlPool case

For this case, things get a bit trickier.

We have to options:

Do not divide the resources. We just submit an srun and let the SLURM manage the stuff. It will create multiple MAPDL processes which spawn across same nodes. For instance:

MapdlPool 0: 2 Mapdl instances, on 4 nodes
- MAPDL instance 0: On node 0, node 1, node 2, node 3
- MAPDL instance 1: On node 0, node 1, node 2, node 3
This might not be a reasonable resource sharing scheme. The next mode proposes something different.
Divide by ourselves the resources. We divide the number of nodes across the number of instances. For instance:

MapdlPool 0: 2 Mapdl instances, on 4 nodes
- MAPDL instance 0: On node 0, node 1
- MAPDL instance 1: On node 2, node 3

The first case is the most simple to implement but I can see the performance is not as good as the second.

I guess from the user perspective, we should keep things simple.

I can imagine an option (bool) like share_nodes that if True, it will use the first method. When false, it will use the second.

from ansys.mapdl.core import MapdlPool

slurm_options = {
    "partition": "qsmall",
    "nodes": 16,
    "ntasks-per-node": 1
}

# All MAPDL instances spawn across all nodes. 
# using MAPDL -nproc 16
pool = MapdlPool(8, slurm_options=slurm_options, share_nodes=True)

# All MAPDL instances spawn across all nodes. 
# using MAPDL -nproc 16 and:
# On first MAPDL instance: srun --nodes=node0-1
# On second MAPDL instance: srun --nodes=node2-3
# and so on...
pool = MapdlPool(8, slurm_options=slurm_options, share_nodes=False)

But what should happen when the resources are not evenly splitted? For instance, the job has 5 nodes allocated, but we are requesting only 4 MAPDL instances.

I guess we should fallback to mode 1 (share_nodes=True) if no mode is specified (share_nodes=None)? I guess it will depend on how much penalization there is between both schemes. Anyways, a warning message is always appreciated.

Feedback is appreciated here too.

Summary

	`launch_mapdl`	`MapdlPool`
Interactive	`srun`	`srun`
Batch	None	None (`share_nodes=True`)
		`srun` (`share_nodes=False`)

germa89 commented 2 months ago

I think this should be converted to a document section How to run PyMAPDL of HPC... this should also include how to run MAPDL using the mentioned tight integration.

germa89 commented 1 month ago

Comment moved to a new issue: https://github.com/ansys/pymapdl/issues/3467

ansys / pymapdl