Closed slurmuser closed 1 year ago
Since all it does is hang I'm not really sure how I can work around this issue.
You mean entrypoint
?
Entrypoints should be disabled by default: https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration
If you enabled them manually with execute_entrypoint=1
, try removing it.
When it seems hang, you can also look at user processes with ps
or top
to check what is currently executing. Some entrypoints can take a very long time to execute.
Apologies I meant entrypoint yes thank you for the respone. So we can load a container fine without the --container-entrypoint but passing in --container-entrypoint with or without execute_entrypoint set to 0/1 causes a hang. With regard to the top/ps I see the line:
bash /usr/bin/enroot import --output /run/pyxis/0/58463113.0.squashfs ......etc
but then nothing after a couple of mins. This seems to be in line with the time it takes to import without the --container-entrypoint flag so I'm guessing it is importing the container fine with the flag but for whatever reason can't execute the entrypoint code above.
The entrypoint above is just an example but being able to execute them within slurm is quite critical to our process so the fix of just blocking them isn't really ideal.
There might be else going on here, an entrypoint should not cause a hang at the enroot import
phase. It would delay the enroot start
phase only. If there is an interference between the entrypoint and the import
it would be a bug.
Can you check if squash file /run/pyxis/0/<JOBID>.<STEPID>.squashfs
exists and if yes, can you check if it is growing? Do you have other processes running and still downloading the image? Look for tar
, curl
, zstd
or enroot
processes.
Hi thank you for the response. So enroot runs the command for a second then stops looking at ps (the bash /usr/bin/enroot import --output command). A grep of ls returns empty for all required tar/curl/zstd/enroot.
Both our /run/pyxis directories for 0 and the job id are empty:
[root ]# cd /run/pyxis/ [root pyxis]# ls 0 3065989 [root pyxis]# ls -a 0/ . .. [root pyxis]# ls -a 3065989/ . ..
A grep of ls returns empty for all required tar/curl/zstd/enroot.
If you don't see enroot import
either, then you might be executing the entrypoint.
What is top
showing? Do you see CPU utilization for /usr/bin/scl
or another process from the user? Some entrypoints are very long but that's not due to pyxis, it could also be that you need to tweak your enroot.conf
if the container rootfs is stored on a slow filesystem (e.g. NFS) and the entrypoint is crawling all the container filesystem.
Hi so yes it does appear to be stuck on the entrypoint:
[ ~]$ ps -aux |grep -i scl 9862 0.0 0.0 4356 580 ? S 12:20 0:00 /usr/bin/scl enable rh-python36 -- sh -c kill -STOP $$ ; exit 0
However, when this runs with podman it takes a matter of 10's of seconds where as I have had it hang with pyxis or 30minutes+ before cancelling. With regards to the root fs it was migrated locally from a shared fs a while ago.
I think I see what's going on here. Can you try using the --exec
flag of scl
? By default it will use the system
function instead. Pyxis expects entrypoints to exec
into the subcommand to execute, and most entrypoints do.
It should look like this (I could not test it):
ENTRYPOINT ["/usr/bin/scl", "enable", "rh-python36", "--exec", "--"]
If your scl binary is less than 9 years old (https://github.com/sclorg/scl-utils/commit/6a375fdff802e9bea41d1704bc91e00579ca8927) then this flag should be supported, but I think that excludes CentOS 7 for example.
If --exec
does not work, try the entrypoint approach mentioned in https://austindewey.com/2019/03/26/enabling-software-collections-binaries-on-a-docker-image/
#!/bin/bash
source scl_source enable rh-python35
exec "$@"
This script is doing an exec
so it should work with pyxis.
Hopefully the approach above is working for you, closing this issue.
Hello we are currently having issue whereby our container is hanging forever when passing in an endpoint. I have seen similar stuff in old logs but wondered how we could get around this. For reference this is our docker file entrypoint:
`ARG base=artifactory.xxxxxx.com/prod-docker/base-image:latest FROM ${base}
RUN yum install -y \ rh-python36 \ rh-python36-python-pip \ && yum clean all
ENTRYPOINT ["/usr/bin/scl", "enable", "rh-python36", "--"] CMD ["python", "--version"]`