E4S-Project / e4s-cl

Container manager for E4S
https://e4s-cl.readthedocs.io
MIT License
14 stars 3 forks source link

Podman: Fix calculation of preserve-fds #111

Closed egreen77 closed 1 year ago

egreen77 commented 1 year ago

Hi,

I noticed that the code for the podman backend doesn't always pass the correct number of file descriptors to the runtime. This causes MPI PMI linkup failure in some cases where it relies on a higher file descriptor number.

I added a ls -la /proc/self/fd command to my test job launch and observed that the passed file descriptors are cut off. It appears this because the value given to the --preserve-fds argument to podman is calculated before the filler fds are created and so podman ultimately sees a lower value for that argument than it should.

This PR contains the patch I applied locally to work around this issue while keeping the debug log output unchanged.

Original:

[Debug e4s_cl.cf.containers.podman:120] Passing 3 file descriptors: ({9, 10, 15})
[Debug e4s_cl.cf.containers.podman:89] Created 10 file descriptors: [3, 4, 5, 6, 7, 8, 11, 12, 13, 14]
total 0
dr-x------ 2 root root  0 Apr  4 16:11 .
dr-xr-xr-x 9 root root  0 Apr  4 16:11 ..
lrwx------ 1 root root 64 Apr  4 16:11 0 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:11 1 -> pipe:[5713802]
l-wx------ 1 root root 64 Apr  4 16:11 2 -> pipe:[5713803]
l-wx------ 1 root root 64 Apr  4 16:11 3 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:11 4 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:11 5 -> /dev/null
lr-x------ 1 root root 64 Apr  4 16:11 6 -> /proc/299239/fd

[cli_0]: write_line error; fd=13 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=13 buf=:cmd=get_appnum

Patched:

[Debug e4s_cl.cf.containers.podman:90] Passing 3 file descriptors: ({9, 10, 15})
[Debug e4s_cl.cf.containers.podman:91] Created 10 file descriptors: [3, 4, 5, 6, 7, 8, 11, 12, 13, 14]
total 0
dr-x------ 2 root root  0 Apr  4 16:13 .
dr-xr-xr-x 9 root root  0 Apr  4 16:13 ..
lrwx------ 1 root root 64 Apr  4 16:13 0 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:13 1 -> pipe:[47492848]
lr-x------ 1 root root 64 Apr  4 16:13 10 -> pipe:[47470215]
l-wx------ 1 root root 64 Apr  4 16:13 11 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:13 12 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:13 13 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:13 14 -> /dev/null
lrwx------ 1 root root 64 Apr  4 16:13 15 -> socket:[47470222]
lr-x------ 1 root root 64 Apr  4 16:13 16 -> /proc/3529402/fd
l-wx------ 1 root root 64 Apr  4 16:13 2 -> pipe:[47492849]
l-wx------ 1 root root 64 Apr  4 16:13 3 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:13 4 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:13 5 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:13 6 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:13 7 -> /dev/null
l-wx------ 1 root root 64 Apr  4 16:13 8 -> /dev/null
lr-x------ 1 root root 64 Apr  4 16:13 9 -> pipe:[47470214]
spoutn1k commented 1 year ago

Hi, good catch ! Thank you for the PR !