NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

Hangs indefinitely when using nvcr image #40

Closed BlueCloudDev closed 3 years ago

BlueCloudDev commented 3 years ago

[opc@modest-cobra-bastion examples]$ srun --container-image=centos grep PRETTY /etc/os-release pyxis: importing docker image ... PRETTY_NAME="CentOS Linux 8"

[opc@modest-cobra-bastion examples]$ srun --container-image "nvcr.io/nvidia/pytorch:20.12-py3" grep PRETTY /etc/os-release pyxis: importing docker image ...

The command sits indefinitely. I've left it for up to 20 minutes without any change. I can't find any network traffic on the node that would indicate that it is downloading, nor any CPU activity that would indicate other work related to the image. The .credentials file is set as:

[opc@inst-rmdks-modest-cobra tmp]$ cat ~/enroot/.credentials

NVIDIA GPU Cloud (both endpoints are required)

machine nvcr.io login $oauthtoken password machine authn.nvidia.com login $oauthtoken password

flx42 commented 3 years ago

Hi @BlueCloudDev,

It might not be a hang from pyxis here, I agree the logging is confusing here. It prints the start of the image import, but it doesn't print when the image import is done and the container is started.

Can you check which enroot processes are running on this node when it happens? I'm interested to check whether it's indeed stuck in enroot import, or perhaps in enroot start.

flx42 commented 3 years ago

The nvcr.io images also ship with an entrypoint that can take some time to execute (but shouldn't cause a hang), it should be visible when looking at running processes. You can disable the entrypoint with an enroot hook:

$ cat /etc/enroot/mounts.d/90-entrypoint.fstab 
/etc/enroot/entrypoint /etc/rc.local none x-create=file,bind,ro,nosuid,nodev,noexec,nofail,silent

$ cat /etc/enroot/entrypoint 
if [ $# -gt 0 ]; then
    exec "$@"
else
    exec '/bin/bash'
fi
BlueCloudDev commented 3 years ago

I may need some hand-holding here since I haven't worked with pyxis/enroot before. What steps do I have to take in order to check which enroot processes are running on the node?

[opc@inst-rmdks-modest-cobra ~]$ cat /etc/enroot/mounts.d/90-entrypoint.fstab cat: /etc/enroot/mounts.d/90-entrypoint.fstab: No such file or directory [opc@inst-rmdks-modest-cobra ~]$ cat /etc/enroot/mounts.d/ 10-system.fstab 20-config.fstab

[opc@inst-rmdks-modest-cobra ~]$ cat /etc/enroot/ enroot.conf enroot.conf.original environ.d/ hooks.d/ mounts.d/

I'm not seeing these folders/files

BlueCloudDev commented 3 years ago

enroot list returns nothing on both the control and worker nodes

flx42 commented 3 years ago

What steps do I have to take in order to check which enroot processes are running on the node?

Try ps ax | grep enroot, or check manually with htop. e.g. when pyxis is importing the image:

$ ps ax | grep enroot
  36801 pts/2    S+     0:00 bash /usr/bin/enroot import --output /run/pyxis/10009/1249239.1.squashfs docker://nvcr.io#nvidia/pytorch:21.02-py3
  36826 pts/2    S+     0:00 bash /usr/bin/enroot import --output /run/pyxis/10009/1249239.1.squashfs docker://nvcr.io#nvidia/pytorch:21.02-py3
  38116 pts/2    R+     0:00 rm --one-file-system --preserve-root -rf /tmp/enroot.aYZIBfBB3y

I'm not seeing these folders/files

Yeah, I was saying you need to add these files if you want to disable entrypoints from the nvcr.io image. But let's take a look at the running processes first.

BlueCloudDev commented 3 years ago
 74612 ?        S      0:00 bash /usr/bin/enroot import --output /run/pyxis/1000/297.0.squashfs docker://nvcr.io/nvidia/pytorch:20.12-py3
 74626 ?        S      0:00 bash /usr/bin/enroot import --output /run/pyxis/1000/297.0.squashfs docker://nvcr.io/nvidia/pytorch:20.12-py3
 75216 ?        Sl     0:30 enroot-mksquashovlfs /tmp/enroot.d6X4vcQhZd/rootfs /run/pyxis/1000/297.0.squashfs -all-root -no-progress -processors 2 -comp lzo -noD
 76204 pts/2    S+     0:00 grep --color=auto enroot

Looks like it is indeed running. How long is this process expected to take? I've left it running for >1 hr and the command doesn't progress past the step I highlighted. I've noticed that it creates files in /tmp by default, does this process need to repeat every time a job is run?

flx42 commented 3 years ago

I see that it is using only 2 processes here: -processors 2. Perhaps it is running under a Slurm job allocation with only 2 cores, which could slow the import compared to running enroot import outside of pyxis. But even with 2 cores being used, it should not take 1 hour.

I've noticed that it creates files in /tmp by default, does this process need to repeat every time a job is run?

This depends on your configuration in enroot.conf, here is what we have:

ENROOT_RUNTIME_PATH /run/enroot/user-$(id -u)
ENROOT_CACHE_PATH /raid/enroot-cache/group-$(id -g)
ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u)

Where /run and /tmp are tmpfs (we have enough RAM), and /raid is a RAID0 ext4 filesystem.

Some questions to continue investigating:

BlueCloudDev commented 3 years ago

As far as CPU utilization goes, on the node I'm not seeing much at all. This is a 64-core 8x A100 system and next to nothing showing up on htop while running through slurm.

I modified the command you gave to account for the authentication with $oauthtoken.

[opc@inst-rmdks-modest-cobra tmp]$ sudo enroot import --output /run/pyxis/test.squashfs docker://$oauthtoken@nvcr.io#nvidia/pytorch:20.12-py3
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: <anonymous>
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[INFO] Downloading 59 missing layers...

100% 59:0=0s 76c8b977361605981037d3f2290e042752b13007bcbadf39c0d6af61dd01675c

[INFO] Extracting image layers...

100% 58:0=0s 6a5697faee43339ef8e33e3839060252392ad99325a48f7c9d7e93c22db4d4cf

[INFO] Converting whiteouts...

100% 58:0=0s 6a5697faee43339ef8e33e3839060252392ad99325a48f7c9d7e93c22db4d4cf

[INFO] Creating squashfs filesystem...

Parallel mksquashfs: Using 128 processors
Creating 4.0 filesystem on /run/pyxis/test.squashfs, block size 131072.
[=========================================================================================================================|] 245848/245848 100%

Exportable Squashfs 4.0 filesystem, lzo compressed, data block size 131072
        uncompressed data, compressed metadata, compressed fragments, compressed xattrs
        duplicates are removed
Filesystem size 12754798.54 Kbytes (12455.86 Mbytes)
        92.89% of uncompressed filesystem size (13730493.38 Kbytes)
Inode table size 2150147 bytes (2099.75 Kbytes)
        32.17% of uncompressed inode table size (6682988 bytes)
Directory table size 2200012 bytes (2148.45 Kbytes)
        43.00% of uncompressed directory table size (5116464 bytes)
Number of duplicate files found 28404
Number of inodes 178050
Number of files 149503
Number of fragments 8510
Number of symbolic links  4261
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 24286
Number of ids (unique uids + gids) 1
Number of uids 1
        root (0)
Number of gids 1
        root (0)

The process executed quickly

Enroot version

[opc@inst-rmdks-modest-cobra tmp]$ enroot version
3.2.0

Slurm version

[opc@inst-rmdks-modest-cobra tmp]$ sinfo --version
slurm 20.11.5
flx42 commented 3 years ago

So with enroot directly you are using more cores indeed:

Parallel mksquashfs: Using 128 processors

I would recommend giving all cores to your Slurm job, and trying again with pyxis, and then check if the result is any different.

Also, when the container import seems hang, can you monitor the size of the squashfs file being created? e.g. in your case above, please check the size of /run/pyxis/1000/297.0.squashfs over time. @3XX0 and myself are thinking it might be increasing very slowly.

You could also try with ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates in your enroot.conf, this should make the import faster, at the cost of more storage space.

flx42 commented 3 years ago

Sharing your whole enroot.conf could also be useful here.

BlueCloudDev commented 3 years ago

enroot.conf - mostly empty

[opc@inst-rmdks-modest-cobra tmp]$ cat /etc/enroot/enroot.conf
#ENROOT_LIBRARY_PATH        /usr/lib/enroot
#ENROOT_SYSCONF_PATH        /etc/enroot
ENROOT_RUNTIME_PATH        /tmp/enroot/user-$(id -u)
#ENROOT_CONFIG_PATH         ${XDG_CONFIG_HOME}/enroot
#ENROOT_CACHE_PATH          ${XDG_CACHE_HOME}/enroot
ENROOT_DATA_PATH           /tmp/enroot-data/user-$(id -u)
#ENROOT_TEMP_PATH           ${TMPDIR:-/tmp}

# Gzip program used to uncompress digest layers.
#ENROOT_GZIP_PROGRAM        gzip

# Options passed to zstd to compress digest layers.
#ENROOT_ZSTD_OPTIONS        -1

# Options passed to mksquashfs to produce container images.
#ENROOT_SQUASH_OPTIONS      -comp lzo -noD

# Make the container root filesystem writable by default.
#ENROOT_ROOTFS_WRITABLE     no

# Remap the current user to root inside containers by default.
#ENROOT_REMAP_ROOT          no

# Maximum number of processors to use for parallel tasks (0 means unlimited).
#ENROOT_MAX_PROCESSORS      $(nproc)

# Maximum number of concurrent connections (0 means unlimited).
#ENROOT_MAX_CONNECTIONS     10

# Maximum time in seconds to wait for connections establishment (0 means unlimited).
#ENROOT_CONNECT_TIMEOUT     30

# Maximum time in seconds to wait for network operations to complete (0 means unlimited).
#ENROOT_TRANSFER_TIMEOUT    0

# Number of times network operations should be retried.
#ENROOT_TRANSFER_RETRIES    0

# Use a login shell to run the container initialization.
#ENROOT_LOGIN_SHELL         yes

# Allow root to retain his superuser privileges inside containers.
#ENROOT_ALLOW_SUPERUSER     no

# Use HTTP for outgoing requests instead of HTTPS (UNSECURE!).
#ENROOT_ALLOW_HTTP          no

# Include user-specific configuration inside bundles by default.
#ENROOT_BUNDLE_ALL          no

# Generate an embedded checksum inside bundles by default.
#ENROOT_BUNDLE_CHECKSUM     no

# Mount the current user's home directory by default.
#ENROOT_MOUNT_HOME          no

# Restrict /dev inside the container to a minimal set of devices.
#ENROOT_RESTRICT_DEV        no

# Always use --force on command invocations.
#ENROOT_FORCE_OVERRIDE      no

# SSL certificates settings:
#SSL_CERT_DIR
#SSL_CERT_FILE

# Proxy settings:
#all_proxy
#no_proxy
#http_proxy
#https_proxy
[opc@modest-cobra-bastion examples]$ srun --cpus-per-task=128 --container-image "nvcr.io/nvidia/pytorch:20.12-py3" grep PRETTY /etc/os-release
pyxis: importing docker image ...
PRETTY_NAME="Ubuntu 20.04.1 LTS"

This works, seems to have been limited by the number of CPUs getting set to 2 somewhere. Where should default resource limits be configured? Slurm or Pyxis?

flx42 commented 3 years ago

This works, seems to have been limited by the number of CPUs getting set to 2 somewhere. Where should default resource limits be configured? Slurm or Pyxis?

It's how the Slurm job (or Slurm partition) was configured, Slurm likely created a cpuset cgroup to limit which cores can be accessed by your process. For example:

$ srun -l --ntasks-per-node=1 grep Cpus_allowed_list /proc/self/status
0: Cpus_allowed_list:   0,6

Here it's 2 logical cores (1 physical core). But if the partition is set to exclusive, you would see all the cores being available. Pyxis uses all the cores allocated to the Slurm job.

I also recommend to set ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates in enroot.conf, it should speed up the process even if you have only 2 cores available.

flx42 commented 3 years ago

I think this issue is now resolved, closing. Please reopen this bug or open a new one if you have other questions on this topic.