NERSC / podman-hpc

Other
34 stars 5 forks source link

Interactive job: cannot restart service after SIGINT sent to process #55

Closed swelborn closed 1 year ago

swelborn commented 1 year ago

I wrote a thread about this in #podman slack channel, copying the discussion here:

Before I debugged it...

I have a bug happening in an interactive job. I will file an issue if you all think it is related to podman. Here is the rundown:

  1. Start interactive job.
  2. Run my start script (start_node.sh):
    
    #!/bin/bash

srun -N 2 -n 2 podman-hpc run --rm -v $HOME:/mnt/ --network=host -it samwelborn:stempy-streaming /mnt/utility/run_node.sh

3. This runs run_node.sh inside:

```bash
#!/bin/bash

rm -rf /stempy
cp -rf /mnt/gits/stempy/ /stempy/
/mnt/gits/4dstem/build/zmq/node
  1. This works perfectly well when I run it the first time. /mnt/gits/4dstem/build/zmq/node is a service that connects to NCEM and receives data.
  2. When I ctrl+C to stop this, it stops normally.
  3. When I try to run it again, I get this:
/global/homes/s/swelborn/utility/start_node.sh
srun: error: nid004402: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=7414439.2
srun: error: nid004403: task 1: Exited with exit code 1

Maybe it is just a non-clean exit from the node binary? Thing is, ps -u gives me this after I shutdown the previous process:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
swelborn 233308  0.0  0.0  66772 10100 pts/0    Ss   09:14   0:00 /bin/bash
swelborn 235313  0.0  0.0  78836  6352 pts/0    R+   09:26   0:00 ps -u

It works just fine if I exit the job and start another one up.

Debugging

@tylern4 pointed me to #!/bin/bash -x, and the program was indeed hanging on

srun -N 2 -n 2 podman-hpc run --rm -v /global/homes/s/swelborn:/mnt/ --network=host -it samwelborn:stempy-streaming /mnt/utility/run_node.sh

After adding --log-level=debug to my run command, I come up with the following:

$ podman-hpc run --rm -v $HOME:/mnt/ --log-level=debug --network=host -it samwelborn:stempy-streaming /mnt/utility/run_node.sh
...

DEBU[0000] Allocated lock 1 for container b93086ffde61a300362490bf1d6cc4da0475bf421c59489947add1d2addf0e45
DEBU[0000] parsed reference into "[overlay@/tmp/99894_hpc/storage+/tmp/99894_hpc:additionalimagestore=/pscratch/sd/s/swelborn/storage,mount_program=/usr/bin/fuse-overlayfs-wrap,ignore_chown_errors=true]@fe1a71ac19d0e25bff3eebfdb06fb1455dd2f3cc41fb7e6f6a18aa07d27456e8"
DEBU[0000] exporting opaque data as blob "sha256:fe1a71ac19d0e25bff3eebfdb06fb1455dd2f3cc41fb7e6f6a18aa07d27456e8"
DEBU[0000] Failed to create container distracted_robinson(b93086ffde61a300362490bf1d6cc4da0475bf421c59489947add1d2addf0e45): creating read-write layer with ID "f80d05c1bfe08135b9e7097d34b5e91b052c5932802260798c70d29f208f03b4": Stat /pscratch/sd/s/swelborn/storage/overlay/3b5c51d2b64b4532d9cec8183ca531f64ce5923cf8edd45dcd2aa425dcf550e4/diff: transport endpoint is not connected
Error: creating container storage: creating read-write layer with ID "f80d05c1bfe08135b9e7097d34b5e91b052c5932802260798c70d29f208f03b4": Stat /pscratch/sd/s/swelborn/storage/overlay/3b5c51d2b64b4532d9cec8183ca531f64ce5923cf8edd45dcd2aa425dcf550e4/diff: transport endpoint is not connected

Fix (?)

I was also coming up with this:

time="2023-04-13T12:10:07-07:00" level=error msg="invalid internal status, try resetting the pause process with \"/usr/bin/podman system migrate\": could not find any running process: no such process"

So I changed my run script to this:

#!/bin/bash -x

srun -N 2 -n 2 podman-hpc system migrate
sleep 1s
srun -N 2 -n 2 podman-hpc run --network=host --log-level=debug --rm -v $HOME:/mnt -it samwelborn:stempy-streaming /mnt/utility/run_node.sh

and it seems to stop the previous containers. My program then starts normally:

+ srun -N 2 -n 2 podman-hpc system migrate
stopped 6871fad7ce5bb197f420a2ecf281b2f1b8435d780ff0b6cd5129b64cba3c86ac
stopped 8184a8e76bd6413b39661387a1bedaea37f4a5a1412c0d89d498f99af6445656
...
# normal program output
lastephey commented 1 year ago

It's possible this is related to https://github.com/NERSC/podman-hpc/issues/54 and may be addressed by https://github.com/NERSC/podman-hpc/pull/62

lastephey commented 1 year ago

Should hopefully be fixed via https://github.com/NERSC/podman-hpc/pull/62