tar2sqfs/gensquashfs do not parallelize compression [well?]

mattst88 commented 4 years ago

When using tar2sqfs or gensquashfs regardless of the number of jobs I request CPU usage never goes over ~105%. scanelf -n indeed shows that they are linked against libpthread.

I use the following script to produce squashfs images containing Gentoo's ebuild repository:

#!/bin/bash

set -x -o pipefail

: ${repouri:="https://anongit.gentoo.org/git/repo/sync/gentoo.git"}
: ${outdir:="/var/db/repo/sqfs"}
: ${gitdir:="/tmp/gentoo.git"}
: ${compression_level:="4"}
: ${compression:="zstd"}
: ${use_worktree:="yes"}

if [ ! -d $gitdir ]; then
        git clone --bare --depth=1 $repouri $gitdir
        git -C $gitdir config --add remote.origin.fetch +refs/*:refs/*
fi

git -C $gitdir fetch -q --depth=1
git -C $gitdir reflog expire --all --expire=now
git -C $gitdir prune --expire=now

if [ ! -z "$use_worktree" ]; then
        tmp=$(mktemp -d)
        worktree=$tmp/gentoo

        git -C $gitdir worktree add -q $worktree stable

        #mksquashfs $worktree $outdir/gentoo.sqfs.tmp -no-progress -quiet -comp $compression -Xcompression-level $compression_level -e .git .gitignore
        gensquashfs -D $worktree -j 8 -q -c $compression -X level=$compression_level $outdir/gentoo.sqfs.tmp
        ret=$?

        git -C $gitdir worktree remove --force $worktree
        rmdir $tmp
else
        git -C $gitdir archive --format=tar stable | tar2sqfs $outdir/gentoo.sqfs.tmp -q -c $compression -X level=$compression_level
        ret=$?
fi

[ $ret -ne 0 ] && exit $ret

mv $outdir/gentoo.sqfs{.tmp,}

Replacing the gensquashfs line with the mksquashfs line reduces the time to run from minutes to less than 10 seconds. Preferably I would just use tar2sqfs and avoid checking out a git worktree (or even better: add support to git archive for producing squashfs images directly).

Is it expected that tar2sqfs or gensquashfs do not use many cores as well as mksquashfs?

AgentD commented 4 years ago

The thread pool implementation that is behind this is IMO pretty stupid. I made sure it works with the intention to get back to it later. Interestingly, with my tests it has shown "reasonable" results so far.

As a benchmark I used a squashfs image from the Debian live DVD and ran "sqfs2tar debian.sqfs | tar2sqfs -j 4 -f out.sqfs". The dead simple thread pool managed to get me a 3x speed up, so there is definitely room for improvement. A real 4x speed up is probably unrealistic since there are synchronisation points like fragment blocks.

Another factor is memory consumption: mksquashfs fills up the entire RAM. My implementation has a rather low maximum of in-flight blocks and stops filling the queue if that threshold is reached. A few commits back I implemented an option that can be used to increase this backlog.

I wouldn't work with the current git tree at the moment tough, since I'm still doing refactoring to make libsquashfs.so a thing and haven't run all of the static analysis and regression tests yet.

AgentD commented 4 years ago

Another thing: Does your git tree consist entirely of files smaller than the block size?

If so, that might be an explanation.

Fragment processing (checksumming, de-duplication and indexing; the last two need synchronisation) is done entirely in the main thread. Once a fragment block is full, it is submitted to the work queue which does compressing in one of the worker threads.

I did some profiling once and determined the crc32 checksumming to rank rather high among the time wasters. This was also the reason I threw out the crc32 implementation and used the one from zlib, making zlib a hard dependency.

mksquashfs also uses a much simpler 16 bit BSD checksum to determine two blocks are equal.

mattst88 commented 4 years ago

Another factor is memory consumption: mksquashfs fills up the entire RAM. My implementation has a rather low maximum of in-flight blocks and stops filling the queue if that threshold is reached. A few commits back I implemented an option that can be used to increase this backlog.

Thanks, that sounds useful. I would gladly trade memory usage for lower CPU utilization.

Another thing: Does your git tree consist entirely of files smaller than the block size?

Pretty close to it. Some shell magic tells me that 95% of the files are <= 4096 bytes.

mattst88 commented 4 years ago

AgentD closed this

Oh, I wasn't aware that we expected this to be solved yet. I retested with v0.7 and for my use case tar2sqfs now takes 12 seconds vs 2 seconds for mksquashfs. So that's a massive improvement over the 90 seconds I recall tar2sqfs taking.

I suspect there's still some performance to be gained but this is a very good improvement. For my own knowledge, what commits do you think caused the significant performance improvement?

AgentD commented 4 years ago

Thanks for the feedback! It's definitely interesting to hear about the actual impact of the current implementation.

There is room for improvement and I'm definitely not that happy with the current implementation of the tread pool block processor. I considered it good enough for now with the intention to improve upon it later (with the changes completely hidden behind the API) and do some actual profiling.

A contributing factor to the performance difference might also that libsquashfs uses crc32 for block deduplication, while mksquashfs uses a 16 bit BSD checksum.

As expected, moving the checksumming into the worker thread greatly improves performance for applications that have lots of files smaller than block size. This was done in commit 9bc8200, but a lot refactoring was required to get there. Unfortunately, this is spread over a bunch of commits with other stuff done in between (caused by procrastinating and not cleaning up afterwards).

AgentD / squashfs-tools-ng

tar2sqfs/gensquashfs do not parallelize compression [well?] #19