Closed BlueCloudDev closed 3 years ago
What version of squashfs-tools
do you have?
Also can you paste the output of uname -a
[opc@inst-2a21p-distinct-cougar ~]$ rpm -qa | grep squash
squashfuse-libs-0.1.102-1.el7.x86_64
squashfuse-0.1.102-1.el7.x86_64
squashfs-tools-4.3-0.21.gitaae0aff4.el7.x86_64
[opc@inst-2a21p-distinct-cougar ~]$ uname -a
Linux inst-2a21p-distinct-cougar 5.4.17-2036.100.6.1.el7uek.x86_64 #2 SMP Thu Oct 29 17:04:48 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Can you try tuning the block size, e.g. ENROOT_SQUASH_OPTIONS='-b 262144'
. Also make sure that you have a sufficient number of file descriptor available (i.e. try tuning ulimit -n
)
I raised the limit for open files previous to this issue to avoid issues during training workloads
[opc@inst-2a21p-distinct-cougar ~]$ ulimit -n
211238876
ENROOT_SQUASH_OPTIONS doesn't seem to have an impact with that value.
[opc@inst-2a21p-distinct-cougar ~]$ export ENROOT_SQUASH_OPTIONS='-b 262144'
[opc@inst-2a21p-distinct-cougar ~]$ enroot create --name cuda nvidia+cuda+10.0-base.sqsh
[INFO] Extracting squashfs filesystem...
FATAL ERROR:Size too large in queue_init
The cluster is 2x A100 systems with 8x GPUs each, plus a bastion node. This issue is happening on the other node as well.
[opc@inst-bpn9t-distinct-cougar ~]$ enroot create --name cuda nvidia+cuda+10.0-base.sqsh
[INFO] Extracting squashfs filesystem...
FATAL ERROR:Size too large in queue_ini
The only custom settings in enroot.conf are: ENROOT_RUNTIME_PATH /tmp/enroot/user-$(id -u) ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u)
We have also tried uninstalling enroot + squashfs and reinstalling, which did not resolve the issue.
You need to import the image again after setting ENROOT_SQUASH_OPTIONS
, enroot create
just unpacks its content.
How did you install squashfs-tools
? You could try a newer version see if it changes anything
I'm trying to build a solution that automates GPU testing through images. My impression was that I could import an image from docker, and then upload the .sqsh file to object storage for later use on other systems by downloading the artifact and creating a container through enroot create. I do not want to have a dependency on docker, private docker repositories, or any public internet access. Is that architecture not possible with enroot?
Yeah this is fine, but it seems that you're running into a limitation of squashfs-tools so increasing the block size when you import it the first time might help
Setting ENROOT_SQUASH_OPTIONS as specified and reimporting doesn't work for both my custom image or the example ubuntu and CUDA images. Do you know what limitation I'm running into specifically? squashfs-tools was installed by following the installation directions for enroot.
What seems odd to me is that this image was working fine, then suddenly both nodes started having this problem at the same time. I had been running this image with pyxis for a few weeks now.
Looking at the squashfs code, the problem seems that your ulimit -n
might be too high actually and cause the overflow. Maybe it worked before you raised it up?
I confirm that ulimit -n
being too high is the issue, I was able to reproduce the problem on Ubuntu 20.04 with the same ulimit:
# ulimit -n
211238876
# enroot create -n cuda nvidia+cuda+10.0-base.sqsh
[INFO] Extracting squashfs filesystem...
FATAL ERROR:Size too large in queue_init
The overflow happens in function queue_init
, an overflow check is done for (size + 1) * sizeof(void*)
https://github.com/plougher/squashfs-tools/blob/4.4/squashfs-tools/unsquashfs.c#L180
This function is called multiple times, and in one case it uses max_files * 2 + data_buffer_size
as the input size:
https://github.com/plougher/squashfs-tools/blob/4.4/squashfs-tools/unsquashfs.c#L2360
Skipping a few steps, the following equation must be satisfied:
(((rlimit_nofile - OPEN_FILE_MARGIN) * 2 + data_buffer_size) + 1) * sizeof(void*) <= INT_MAX
Where rlimit_nofile=211238876
in your case, OPEN_FILE_MARGIN=10
, data_buffer_size
depends on the squashfs file but by default is 2048, sizeof(void*)=8
and INT_MAX=2**31 -1
So for a typical block size, 134216713
is the upper bound for ulimit -n
, and we probably need to file a bug against squash-tools to see if this can be fixed.
# ulimit -n 134216713
# enroot create -n cuda nvidia+cuda+10.0-base.sqsh
[INFO] Extracting squashfs filesystem...
Parallel unsquashfs: Using 8 processors
3394 inodes (3914 blocks) to write
[============================================================================================================================================================================================================================================================|] 3914/3914 100%
created 2882 files
created 652 directories
created 508 symlinks
created 0 devices
created 0 fifos
# ulimit -n 134216714
# enroot create -n cuda nvidia+cuda+10.0-base.sqsh
[INFO] Extracting squashfs filesystem...
FATAL ERROR:Size too large in queue_init
[opc@inst-2a21p-distinct-cougar ~]$ ulimit -n 50000
[opc@inst-2a21p-distinct-cougar ~]$ enroot create nvidia+cuda+10.0-base.sqsh
[INFO] Extracting squashfs filesystem...
Parallel unsquashfs: Using 128 processors
3394 inodes (3914 blocks) to write
[=================================================================================================/] 3914/3914 100%
created 2882 files
created 652 directories
created 508 symlinks
created 0 devices
created 0 fifos
Setting ulimit to a lower value worked. Thanks for the help! This can be closed from my perspective.
Thanks for reporting, we will keep it open and file a bug against squashfs-tools
. We might also set the ulimit in the import as a workaround for now
[opc@inst-2a21p-distinct-cougar ~]$ enroot create -n cuda nvidia+cuda+10.0-base.sqsh [INFO] Extracting squashfs filesystem...
FATAL ERROR:Size too large in queue_init
Can't import a test image, or the custom image i'm trying to import. Searching the error in google reveals only the source code for the issue.