NERSC / podman-hpc

Other
34 stars 5 forks source link

This adds logic to pass through the file descriptors when using openmpi #95

Closed scanon closed 8 months ago

scanon commented 8 months ago

PMI relies on passing through open file descriptors. Podman supports this but there is some extra steps needed to make it work.

lastephey commented 8 months ago

Tested on muller with an openmpi helper module. I'll open a separate MR for the helper module.

stephey@nid001005:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi -v $(pwd):/work -w /work registry.nersc.gov/library/nersc/mpi4py:3.1.3-openmpi python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid001005.
Hello, World! I am process 1 of 2 on nid001005.
lastephey commented 8 months ago

Found out this was wrong, see next comment.

Update: this seems ok on 1 node but fails on 2 nodes. In my test it's because the number of the file descriptor differs.

stephey@nid001003:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 -v $(pwd):/work -w /work registry.nersc.gov/library/nersc/mpi4py:3.1.3-openmpi ./print.sh
PMI_FD:
3
PMI_SIZE=2
PMI_FD=3
ENABLE_OPENMPI_PMI2=1
PMI_SHARED_SECRET=12576424787332504083
PMI_RANK=0
PMI_JOBID=488177.2
contents of /proc/self/fd
0
1
2
255
3
PMI_FD:
12
PMI_SIZE=2
PMI_FD=12
ENABLE_OPENMPI_PMI2=1
PMI_SHARED_SECRET=12576424787332504083
PMI_RANK=1
PMI_JOBID=488177.2
contents of /proc/self/fd
0
1
2
255
stephey@nid001003:/mscratch/sd/s/stephey/openmpi> 
lastephey commented 8 months ago

False alarm, I hadn't updated the podman_hpc.py in the second node of my reservation. Sorry about that.

lastephey commented 8 months ago

stephey@nid001003:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 -v $(pwd):/work -w /work registry.nersc.gov/library/nersc/mpi4py:3.1.3-openmpi ./print.sh nid001005 PMI_FD: 3 PMI_SIZE=2 PMI_FD=3 ENABLE_OPENMPI_PMI2=1 PMI_SHARED_SECRET=12576424787332504083 PMI_RANK=1 PMI_JOBID=488177.14 contents of /proc/self/fd 0 1 2 255 3 nid001003 PMI_FD: 3 PMI_SIZE=2 PMI_FD=3 ENABLE_OPENMPI_PMI2=1 PMI_SHARED_SECRET=12576424787332504083 PMI_RANK=0 PMI_JOBID=488177.14 contents of /proc/self/fd 0 1 2 255 3 stephey@nid001003:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 -v $(pwd):/work -w /work registry.nersc.gov/library/nersc/mpi4py:3.1.3-openmpi python3 -m mpi4py.bench helloworld Hello, World! I am process 0 of 2 on nid001003. Hello, World! I am process 1 of 2 on nid001005. stephey@nid001003:/mscratch/sd/s/stephey/openmpi>