AgentD / squashfs-tools-ng

A new set of tools and libraries for working with SquashFS images
Other
194 stars 30 forks source link

tar2sqfs is slower than mksquashfs #30

Closed bdrung closed 4 years ago

bdrung commented 4 years ago

I am comparing mksquashfs against tar2sqfs (extracted tarball vs uncompressed tarball):

time sudo mksquashfs root root1.squashfs -comp xz -b 524288 -no-exports -no-progress
time tar2sqfs -q --no-skip -c xz -b 524288 root2.squashfs < root.tar

On my Core i7-8850H laptop (12 threads), mksquashfs utilizes all cores all the time and takes 47 seconds. tar2sqfs utilizes all cores in the beginning, but then utilizes only one core and it takes 122 seconds.

I don't know what tar2sqfs does in the last phase, but it would be nice if it can be done in parallel. I am using squashfs-tools-ng 0.7 on Ubuntu 19.10.

AgentD commented 4 years ago

Since I'm not an expert on that matter, I would consider the parallelization strategy currently implemented as sub-optimal. Also, squashfs-tools-ng uses zlib crc32 for deduplicating blocks while mksquashfs uses a 16 bit bsd checksum, which I suspect might be a constant factor in those 122 seconds.

So far I used the Debian live DVD as a bench mark (i.e. sqfs2tar debian.sqfs | tar2sqfs test.sqfs) and was happy after reducing it from 45 minutes to 12 minutes while still producing the same image and went on to focus on other stuff.

I will have to look into this some more and possibly have to pester someone who's better at optimizing parallel code for advice.

AgentD commented 4 years ago

I spent some time trying to clean up the code, staring at traces in hotspot and trying to figure out what's going on.

I implemented a revised strategy for parallel block compression, outlined in doc/parallelism.txt.

As described in the file, I don't have actual measurements yet, but the perf/hotspot traces for the current master now look much better, keeping the CPU maxed out most of the time during my tests. Unpacking and repacking the 2GiB Debian image on a 4 core (bit older) Xeon test machine has also been reduced further from 12 down to 7 minutes.

bdrung commented 4 years ago

I recommend to run the benchmark in on tmpfs since you have enough memory. Then you can still compare the result with running it on a storage system to see if it is I/O bound.