I noticed that the code for the podman backend doesn't always pass the correct number of file descriptors to the runtime. This causes MPI PMI linkup failure in some cases where it relies on a higher file descriptor number.
I added a ls -la /proc/self/fd command to my test job launch and observed that the passed file descriptors are cut off. It appears this because the value given to the --preserve-fds argument to podman is calculated before the filler fds are created and so podman ultimately sees a lower value for that argument than it should.
This PR contains the patch I applied locally to work around this issue while keeping the debug log output unchanged.
Hi,
I noticed that the code for the podman backend doesn't always pass the correct number of file descriptors to the runtime. This causes MPI PMI linkup failure in some cases where it relies on a higher file descriptor number.
I added a
ls -la /proc/self/fd
command to my test job launch and observed that the passed file descriptors are cut off. It appears this because the value given to the--preserve-fds
argument to podman is calculated before the filler fds are created and so podman ultimately sees a lower value for that argument than it should.This PR contains the patch I applied locally to work around this issue while keeping the debug log output unchanged.
Original:
Patched: