Closed bdrung closed 4 years ago
Since I'm not an expert on that matter, I would consider the parallelization strategy currently implemented as sub-optimal. Also, squashfs-tools-ng uses zlib crc32 for deduplicating blocks while mksquashfs uses a 16 bit bsd checksum, which I suspect might be a constant factor in those 122 seconds.
So far I used the Debian live DVD as a bench mark (i.e. sqfs2tar debian.sqfs | tar2sqfs test.sqfs
) and was happy after reducing it from 45 minutes to 12 minutes while still producing the same image and went on to focus on other stuff.
I will have to look into this some more and possibly have to pester someone who's better at optimizing parallel code for advice.
I spent some time trying to clean up the code, staring at traces in hotspot and trying to figure out what's going on.
I implemented a revised strategy for parallel block compression, outlined in doc/parallelism.txt.
As described in the file, I don't have actual measurements yet, but the perf/hotspot traces for the current master now look much better, keeping the CPU maxed out most of the time during my tests. Unpacking and repacking the 2GiB Debian image on a 4 core (bit older) Xeon test machine has also been reduced further from 12 down to 7 minutes.
I recommend to run the benchmark in on tmpfs since you have enough memory. Then you can still compare the result with running it on a storage system to see if it is I/O bound.
I am comparing mksquashfs against tar2sqfs (extracted tarball vs uncompressed tarball):
On my Core i7-8850H laptop (12 threads), mksquashfs utilizes all cores all the time and takes 47 seconds. tar2sqfs utilizes all cores in the beginning, but then utilizes only one core and it takes 122 seconds.
I don't know what tar2sqfs does in the last phase, but it would be nice if it can be done in parallel. I am using squashfs-tools-ng 0.7 on Ubuntu 19.10.