Closed gzquse closed 4 months ago
For normal mpi injection test:
srun -n 2 podman-hpc run -it --mpi --gpu cudaq-mpich:test python3 -m mpi4py.bench helloworld
Error: no container with name or ID "uid-105687-pid-1265444" found: no such container
If you are running on Perlmutter, the --gpu
and --mpi
flags do not work together. The MPI library that's loaded is not compiled with CUDA. To get CUDA aware MPI, you should use the --cuda-mpi
flag.
These options are all specific to the site modules installed at NERSC.
salloc -q interactive -C gpu -t 4:00:00 -A nintern -N 2
srun -n 2 podman-hpc run -it --cuda-mpi --gpu cudaq-mpich:test2 python3 -m mpi4py.bench helloworld
srun: Job 28628103 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for StepId=28628103.2
Error: can only create exec sessions on running containers: container state improper
Error: can only create exec sessions on running containers: container state improper
srun -l -n 2 podman-hpc run -i cudaq-mpich:test2 ls
0: =========================
0: NVIDIA CUDA-Q
0: =========================
0:
0: Version: latest
0:
0: Copyright (c) 2024 NVIDIA Corporation & Affiliates
0: All rights reserved.
0:
0: To run a command as administrator (user `root`), use `sudo <command>`.
0:
0: mpi_comm_impl.o
0: mpich-4.1.1.tar.gz
1: =========================
1: NVIDIA CUDA-Q
1: =========================
1:
1: Version: latest
1:
1: Copyright (c) 2024 NVIDIA Corporation & Affiliates
1: All rights reserved.
1:
1: To run a command as administrator (user `root`), use `sudo <command>`.
1:
1: mpi_comm_impl.o
1: mpich-4.1.1.tar.gz
srun -n 2 podman-hpc run --rm --mpi registry.nersc.gov/library/nersc/mpi4py:3.1.3 python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid200413.
Hello, World! I am process 1 of 2 on nid200421.
The example is given by default setup and uses nersc registry. Do we have to upload to the public repo first?
Many thanks
Hi,
Since your question is NERSC specific, please follow up in the support ticket that you already have open with NERSC.
Thank you!
Hi team,
This is our steps:
remove mpi injection we can get in correctly
We do need to use cuda-awared mpi for our program.
Couple guesses:
Here, we also have the automation script that makes the image public to everyone at NERSC.
This still does not work.
Expected result:
distributing in two nodes