rdsquashfs either hangs or is very slow

goverp commented 11 months ago

I'm using squashfs-tools-ng v1.2.0 on Gentoo on an amd64 machine with lots of memory and a Zen 3 chip. I have a 70 MB file that's a squashfs version (lzo compressed) of a 190 MB directory tree with very many small files (146,000 inodes). For some testing I wanted the original uncompressed tree, so I ran "rdsquashfs -qu / foo".

It appeared to run very slowly (without the -q, the screen filled rapidly with the names of files, as expected, but there are rather a lot). After an age I killed it with Ctrl-C. The top level directories appeared to all exist - I don't know if they were fully populated. I repeated the extract, assuming I hadn't given enough time, or something, but it was still running after more than an hour. "top" showed no significant processing; "iotop" showed rdsquashfs was the heaviest I/O consumer, but only doing 100-200 KB/sec (my 5-disk RAID10 system can achieve 400 MB/sec, so it's not that holding it up).

At this point I realised I could do what I wanted by mounting the squash image and reading it as input (Doh!) - I didn't need to run rdsquashfs at all. This was goodness, as I could read and process the entire directory tree in less than a second! But that leaves something weird in rdsquashfs!

I don't know how the squashfs image was created - it's a Gentoo portage snapshot from a Gentoo mirror, for example: https://www.mirrorservice.org/sites/distfiles.gentoo.org/snapshots/squashfs/gentoo-20230713.lzo.sqfs

AgentD commented 11 months ago

Hi,

if you are unpacking the entire image, that is going to be slower than mounting it and accessing it. rdsquashfs essentially does the following:

1) The entire directory tree is scanned and reconstructed in memory 2) It is sorted and sanity checked (i.e. no two files with the same name in a directory; if one of them was a symlink, this could be used for directory traversal, a well known issue with archiving programs) 2) The directory tree is recursively created on the output filesystem 3) The files are sorted so that the image is accessed mostly sequentially and tail-end blocks don't have to be unpacked several times over 4) The files are unpacked.

In contrast, if you mount the image, only step 1 one happens. It also happens asynchronously, on demand as you start traversing directories. If you don't access the file contents, no file blocks have to be unpacked either, only the meta data blocks from the inode and directory table. The SquashFS kernel driver furthermore has a multi thread decompressor queue, and caches meta data blocks.

If you are only interested in inspecting directory listings, rdsquashfs -l <path> <image> produces a tar-style listing of a selected directory.

Alternatively, rdsquashfs -d <image> produces a listing of the entire image, intended to be compatible with the input format for gensquashfs, i.e. you'll get lines of the shape <type> <path> <mode> <uid> <gid> <extra>. For the image you linked to, producing such a listing takes about a second of pre-processing time on my 6 year old laptop, as it recurses through the directory tree.

Dr-Emann commented 11 months ago

Over an hour for a 190 MB directory tree seems excessive though.

unsquashfs unpacks the same image in about 3 seconds, and I gave up after a few minutes with rdsquashfs, something seems off.

Dr-Emann commented 11 months ago

Ah, needed to wait a little more, not seeing over an hour here, but still pretty long:

Executed in  198.97 secs    fish           external
   usr time    3.35 secs    0.00 micros    3.35 secs
   sys time   16.46 secs  780.00 micros   16.46 secs

Gottox commented 11 months ago

It is sorted and sanity checked (i.e. no two files with the same name in a directory; if one of them was a symlink, this could be used for directory traversal, a well known issue with archiving programs)

Do you have a testcase for this issue or a malformed sqfs archive?

AgentD commented 10 months ago

@Gottox there is an intentionally broken archive in https://github.com/AgentD/squashfs-tools-ng/blob/master/bin/rdsquashfs/test/pathtraversal.sqfs, along with a script that runs rdsquashfs to unpack it and checks if the file in question was created. This test is run by make check along with all the other unit & integration tests.

unsquashfs from squashfs-tools also guards against this kind of issue. There allegedly are "extensive tests" run before releases, but but I'm not aware of any publicly available test suites.

Other archivers guard against this as well (e.g. GNU tar, BusyBox tar, ....), as this kind of problem plagues pretty much every format that supports symlinks.

Gottox commented 10 months ago

Thanks @AgentD!

libsqsh does sanity checking while extracting, not beforehand, I guess that's a faster approach at the cost of accepting some malformed archives. So, in that regard, it's just as secure as tar. sqsh-unpack uses mkstemp-extract-rename semantic to prevent writing through symlinks. That means doing the check in the library isn't needed.

Personally, I doubt that squashfs-tools has a decent test suite.

AgentD / squashfs-tools-ng

rdsquashfs either hangs or is very slow #118