Open moravveji opened 12 months ago
Hi @moravveji, BMTK it-self does not directly call srun or mpirun. It uses standard mpi4py library which relies on your locally installed version of OpenMPI. We've ran large bmtk simulation using both Moab/Torque and Slurm, although how to actually execute them will be different for each cluster.
One thing to try is to create a python script and run directly from the prompt using mpirun (or mpiexec), so
$ mpirun -np 16 python my_bmtk_script.py
Unfortunately, whatever you do will no longer be interactive, and I don't think you can start-up a shell using mpirun (or alteast I've never seen it done before). If you're using Moab I think you can use the qsub -I
option to get an interactive shell, but I haven't tried it myself.
Another option to try is using/compiling a different version of OpenMPI. If you access to anaconda, it might be worth creating a test environment and installing OpenMPI/MPICH2. I believe that when it installs it will try to find the appropriate workload manager options on the system, and if there is a slurm manager on your hpc, will install with PMI support. Although in my experience it doesn't always work, especially if slurm is installed in a non-standard way.
Thanks @kaeldai for your comments. I can already share few thoughts based on our recent try-and-error tests:
mpi4py
from the Intel channel does correctly pick up Slurm. However, the dependency requirements for other tools distributed in the bmtk
environment could not be fully satisfied, because all the necessary tools were not consistently available from the Intel's (ana)conda channel; hence, that was no-GO for usbmtk.analyzer.compartment
package via a batch job (i.e. using sbatch
). This time, the OpenMPI runtime properly spawns processes, and the error above does not appear anymore. The reason for this behavior is that our build of Slurm does support PMI-2, however, our OpenMPI was not configured to make use of PMI support. As a result of that, interactive jobs/tasks launched via srun
fail with the error message aboveSo, the take home message is to avoid using bmtk
in an interactive session (when OpenMPI is not compiled with PMI{2,x}
support).
I have
pip
installed BMTK version 1.0.8 on our HPC cluster, running on Rocky8 OS and with Intel Icelake CPUs. When I start an interactive job with 16 tasks, I fail to import thebmtk.analyzer.compartment
package:I have built
BMTK/1.0.8-foss-2022b
(and all its dependencies) againstOpenMPI/4.1.4-GCC-12.2.0
module. However, this specific OpenMPI module is not built with Slurm support. That's why parallel applications which are launched usingsrun
would spit out the OPAL error message above.I would like to ask if there exists an environment variable to choose how the tasks would be launched? So that I can choose to use
mpirun
directly instead ofsrun
.