FATAL ERROR:Size too large in queue_init

BlueCloudDev commented 3 years ago

[opc@inst-2a21p-distinct-cougar ~]$ enroot create -n cuda nvidia+cuda+10.0-base.sqsh [INFO] Extracting squashfs filesystem...

Can't import a test image, or the custom image i'm trying to import. Searching the error in google reveals only the source code for the issue.

3XX0 commented 3 years ago

What version of squashfs-tools do you have? Also can you paste the output of uname -a

BlueCloudDev commented 3 years ago

[opc@inst-2a21p-distinct-cougar ~]$ rpm -qa | grep squash
squashfuse-libs-0.1.102-1.el7.x86_64
squashfuse-0.1.102-1.el7.x86_64
squashfs-tools-4.3-0.21.gitaae0aff4.el7.x86_64

[opc@inst-2a21p-distinct-cougar ~]$ uname -a
Linux inst-2a21p-distinct-cougar 5.4.17-2036.100.6.1.el7uek.x86_64 #2 SMP Thu Oct 29 17:04:48 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux

3XX0 commented 3 years ago

Can you try tuning the block size, e.g. ENROOT_SQUASH_OPTIONS='-b 262144'. Also make sure that you have a sufficient number of file descriptor available (i.e. try tuning ulimit -n)

BlueCloudDev commented 3 years ago

I raised the limit for open files previous to this issue to avoid issues during training workloads

[opc@inst-2a21p-distinct-cougar ~]$ ulimit -n
211238876

ENROOT_SQUASH_OPTIONS doesn't seem to have an impact with that value.

[opc@inst-2a21p-distinct-cougar ~]$ export ENROOT_SQUASH_OPTIONS='-b 262144'
[opc@inst-2a21p-distinct-cougar ~]$ enroot create --name cuda nvidia+cuda+10.0-base.sqsh
[INFO] Extracting squashfs filesystem...

FATAL ERROR:Size too large in queue_init

The cluster is 2x A100 systems with 8x GPUs each, plus a bastion node. This issue is happening on the other node as well.

[opc@inst-bpn9t-distinct-cougar ~]$ enroot create --name cuda nvidia+cuda+10.0-base.sqsh
[INFO] Extracting squashfs filesystem...

FATAL ERROR:Size too large in queue_ini

The only custom settings in enroot.conf are: ENROOT_RUNTIME_PATH /tmp/enroot/user-$(id -u) ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u)

We have also tried uninstalling enroot + squashfs and reinstalling, which did not resolve the issue.

3XX0 commented 3 years ago

You need to import the image again after setting ENROOT_SQUASH_OPTIONS, enroot create just unpacks its content.

How did you install squashfs-tools? You could try a newer version see if it changes anything

BlueCloudDev commented 3 years ago

I'm trying to build a solution that automates GPU testing through images. My impression was that I could import an image from docker, and then upload the .sqsh file to object storage for later use on other systems by downloading the artifact and creating a container through enroot create. I do not want to have a dependency on docker, private docker repositories, or any public internet access. Is that architecture not possible with enroot?

3XX0 commented 3 years ago

Yeah this is fine, but it seems that you're running into a limitation of squashfs-tools so increasing the block size when you import it the first time might help

BlueCloudDev commented 3 years ago

Setting ENROOT_SQUASH_OPTIONS as specified and reimporting doesn't work for both my custom image or the example ubuntu and CUDA images. Do you know what limitation I'm running into specifically? squashfs-tools was installed by following the installation directions for enroot.

What seems odd to me is that this image was working fine, then suddenly both nodes started having this problem at the same time. I had been running this image with pyxis for a few weeks now.

3XX0 commented 3 years ago

Looking at the squashfs code, the problem seems that your ulimit -n might be too high actually and cause the overflow. Maybe it worked before you raised it up?

flx42 commented 3 years ago

I confirm that ulimit -n being too high is the issue, I was able to reproduce the problem on Ubuntu 20.04 with the same ulimit:

# ulimit -n
211238876

# enroot create -n cuda nvidia+cuda+10.0-base.sqsh 
[INFO] Extracting squashfs filesystem...

FATAL ERROR:Size too large in queue_init

The overflow happens in function queue_init, an overflow check is done for (size + 1) * sizeof(void*) https://github.com/plougher/squashfs-tools/blob/4.4/squashfs-tools/unsquashfs.c#L180

This function is called multiple times, and in one case it uses max_files * 2 + data_buffer_size as the input size: https://github.com/plougher/squashfs-tools/blob/4.4/squashfs-tools/unsquashfs.c#L2360

Skipping a few steps, the following equation must be satisfied:

(((rlimit_nofile - OPEN_FILE_MARGIN) * 2 + data_buffer_size) + 1) * sizeof(void*) <= INT_MAX

Where rlimit_nofile=211238876 in your case, OPEN_FILE_MARGIN=10, data_buffer_size depends on the squashfs file but by default is 2048, sizeof(void*)=8 and INT_MAX=2**31 -1

So for a typical block size, 134216713 is the upper bound for ulimit -n, and we probably need to file a bug against squash-tools to see if this can be fixed.

# ulimit -n 134216713
# enroot create -n cuda nvidia+cuda+10.0-base.sqsh 
[INFO] Extracting squashfs filesystem...

Parallel unsquashfs: Using 8 processors
3394 inodes (3914 blocks) to write

[============================================================================================================================================================================================================================================================|] 3914/3914 100%

created 2882 files
created 652 directories
created 508 symlinks
created 0 devices
created 0 fifos

# ulimit -n 134216714
# enroot create -n cuda nvidia+cuda+10.0-base.sqsh 
[INFO] Extracting squashfs filesystem...

FATAL ERROR:Size too large in queue_init

BlueCloudDev commented 3 years ago

[opc@inst-2a21p-distinct-cougar ~]$ ulimit -n 50000
[opc@inst-2a21p-distinct-cougar ~]$ enroot create nvidia+cuda+10.0-base.sqsh
[INFO] Extracting squashfs filesystem...

Parallel unsquashfs: Using 128 processors
3394 inodes (3914 blocks) to write

[=================================================================================================/] 3914/3914 100%

created 2882 files
created 652 directories
created 508 symlinks
created 0 devices
created 0 fifos

Setting ulimit to a lower value worked. Thanks for the help! This can be closed from my perspective.

3XX0 commented 3 years ago

Thanks for reporting, we will keep it open and file a bug against squashfs-tools. We might also set the ulimit in the import as a workaround for now

NVIDIA / enroot

FATAL ERROR:Size too large in queue_init #90