NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
281 stars 31 forks source link

Entrypoint hanging #109

Closed slurmuser closed 1 year ago

slurmuser commented 1 year ago

Hello we are currently having issue whereby our container is hanging forever when passing in an endpoint. I have seen similar stuff in old logs but wondered how we could get around this. For reference this is our docker file entrypoint:

`ARG base=artifactory.xxxxxx.com/prod-docker/base-image:latest FROM ${base}

RUN yum install -y \ rh-python36 \ rh-python36-python-pip \ && yum clean all

ENTRYPOINT ["/usr/bin/scl", "enable", "rh-python36", "--"] CMD ["python", "--version"]`

slurmuser commented 1 year ago

Since all it does is hang I'm not really sure how I can work around this issue.

flx42 commented 1 year ago

You mean entrypoint?

Entrypoints should be disabled by default: https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration If you enabled them manually with execute_entrypoint=1, try removing it.

When it seems hang, you can also look at user processes with ps or top to check what is currently executing. Some entrypoints can take a very long time to execute.

slurmuser commented 1 year ago

Apologies I meant entrypoint yes thank you for the respone. So we can load a container fine without the --container-entrypoint but passing in --container-entrypoint with or without execute_entrypoint set to 0/1 causes a hang. With regard to the top/ps I see the line:

bash /usr/bin/enroot import --output /run/pyxis/0/58463113.0.squashfs ......etc

but then nothing after a couple of mins. This seems to be in line with the time it takes to import without the --container-entrypoint flag so I'm guessing it is importing the container fine with the flag but for whatever reason can't execute the entrypoint code above.

The entrypoint above is just an example but being able to execute them within slurm is quite critical to our process so the fix of just blocking them isn't really ideal.

flx42 commented 1 year ago

There might be else going on here, an entrypoint should not cause a hang at the enroot import phase. It would delay the enroot start phase only. If there is an interference between the entrypoint and the import it would be a bug.

Can you check if squash file /run/pyxis/0/<JOBID>.<STEPID>.squashfs exists and if yes, can you check if it is growing? Do you have other processes running and still downloading the image? Look for tar, curl, zstd or enroot processes.

slurmuser commented 1 year ago

Hi thank you for the response. So enroot runs the command for a second then stops looking at ps (the bash /usr/bin/enroot import --output command). A grep of ls returns empty for all required tar/curl/zstd/enroot.

Both our /run/pyxis directories for 0 and the job id are empty:

[root ]# cd /run/pyxis/ [root pyxis]# ls 0 3065989 [root pyxis]# ls -a 0/ . .. [root pyxis]# ls -a 3065989/ . ..

flx42 commented 1 year ago

A grep of ls returns empty for all required tar/curl/zstd/enroot.

If you don't see enroot import either, then you might be executing the entrypoint.

What is top showing? Do you see CPU utilization for /usr/bin/scl or another process from the user? Some entrypoints are very long but that's not due to pyxis, it could also be that you need to tweak your enroot.conf if the container rootfs is stored on a slow filesystem (e.g. NFS) and the entrypoint is crawling all the container filesystem.

slurmuser commented 1 year ago

Hi so yes it does appear to be stuck on the entrypoint:

[ ~]$ ps -aux |grep -i scl 9862 0.0 0.0 4356 580 ? S 12:20 0:00 /usr/bin/scl enable rh-python36 -- sh -c kill -STOP $$ ; exit 0

However, when this runs with podman it takes a matter of 10's of seconds where as I have had it hang with pyxis or 30minutes+ before cancelling. With regards to the root fs it was migrated locally from a shared fs a while ago.

flx42 commented 1 year ago

I think I see what's going on here. Can you try using the --exec flag of scl? By default it will use the system function instead. Pyxis expects entrypoints to exec into the subcommand to execute, and most entrypoints do.

It should look like this (I could not test it):

ENTRYPOINT ["/usr/bin/scl", "enable", "rh-python36", "--exec", "--"]

If your scl binary is less than 9 years old (https://github.com/sclorg/scl-utils/commit/6a375fdff802e9bea41d1704bc91e00579ca8927) then this flag should be supported, but I think that excludes CentOS 7 for example.

If --exec does not work, try the entrypoint approach mentioned in https://austindewey.com/2019/03/26/enabling-software-collections-binaries-on-a-docker-image/

#!/bin/bash
source scl_source enable rh-python35
exec "$@"

This script is doing an exec so it should work with pyxis.

flx42 commented 1 year ago

Hopefully the approach above is working for you, closing this issue.