AgentD / squashfs-tools-ng

A new set of tools and libraries for working with SquashFS images
Other
194 stars 30 forks source link

Still missing squashfs-tools features #4

Closed AgentD closed 4 years ago

AgentD commented 4 years ago

The following features are the only remaining missing parts right now:

For both of them it is somewhat questionable whether they see any real life use in squashfs.

marcthe12 commented 4 years ago

The above 2 is useful when have a portage tree in squashfs. Squashfs-tools itself report 11226 duplicates files when portage tree is used. I think portage uses hardlinks also but I am not too sure.

AgentD commented 4 years ago

Thanks for the feedback!

After having already built the necessary foundations, I added data de-duplication based on comparing size & CRC32 of non-sparse data blocks and doing the same separately for fragments.

Even for my simple test filesystem the fragment de-duplication alone manages to eleminate a few hundred (!) fragments, so it looks like this could actually pay off. I have yet to test a filesystem which also has duplicate data blocks in it.

renard commented 4 years ago

Is there a way to perform deduplication with tar2sqfs?

AgentD commented 4 years ago

Hi @renard,

deduplication of blocks and fragments has been implemented in version 0.6 and enabled in both gensquashfs and tar2sqfs by default.

For fragments, a hash table is used (See issue #40 regarding performance problems) and fragments are considered equal if their size and 32 bit hash value match.

For every written block, the block writer keeps a record of hash, compressed size and on-disk location. If a sequence of blocks, is completed, it tries to perform a naive substring search, trying to match the sequence of (compressed size, hash). This could in theory be optimized using a suffix tree, a trie, a multi-valued hash table to find candidates, etc.... but so far, nobody complained about poor performance.

As a hash function, xxhash32 is currently used.

renard commented 4 years ago

@AgentD Here is an example of what I did:

# tar LOT_OF_OPTIONS | tar2sqfs -qf -c zstd /tmp/system.squashfs
# unsquashfs -s /tmp/system.squashfs
Found a valid SQUASHFS 4:0 superblock on /tmp/system.squashfs.
Creation or last append time Thu Jan  1 01:00:00 1970
Filesystem size 309109010 bytes (301864.27 Kbytes / 294.79 Mbytes)
Compression zstd
zstd: bad compression level in compression options structure
zstd: error reading stored compressor options from filesystem!
Block size 131072
Filesystem is not exportable via NFS
Inodes are compressed
Data is compressed
Uids/Gids (Id table) are compressed
Fragments are compressed
Always-use-fragments option is specified
Xattrs are compressed
Duplicates are not removed
Number of fragments 3393
Number of inodes 32640
Number of ids 17

And using gzip:

Found a valid SQUASHFS 4:0 superblock on /tmp/system.squashfs.
Creation or last append time Thu Jan  1 01:00:00 1970
Filesystem size 297931334 bytes (290948.57 Kbytes / 284.13 Mbytes)
Compression gzip
Block size 131072
Filesystem is not exportable via NFS
Inodes are compressed
Data is compressed
Uids/Gids (Id table) are compressed
Fragments are compressed
Always-use-fragments option is specified
Xattrs are compressed
Duplicates are not removed
Number of fragments 3393
Number of inodes 32640
Number of ids 17

The summary is:

Data bytes read: 859.1M
Data bytes written: 294.1M
Data compression ratio: 34%

Data blocks written: 7205
Out of which where fragment blocks: 3393
Duplicate blocks omitted: 139
Sparse blocks omitted: 11

Fragments actually written: 24685
Duplicated fragments omitted: 1612
Total number of inodes: 32640
Number of unique group/user IDs: 17

Maybe the unsquashfs tool is buggy or I did something wrong ;-)

unsquashfs version 4.4 (2019/08/29)
tar2sqfs (squashfs-tools-ng) 0.9.1

Both taken from debian packges. I haven't checked if they patched it thought.

AgentD commented 4 years ago

Block size 131072 Filesystem is not exportable via NFS Inodes are compressed Data is compressed Uids/Gids (Id table) are compressed Fragments are compressed Always-use-fragments option is specified Xattrs are compressed Duplicates are not removed

These are statistics from the super block. There is purely informal flag to indicate whether duplicates have been removed or not. I just double cheked with the squashfs-tools source code and apparently the logic is flipped in squashfs-tools-ng.

Compression zstd zstd: bad compression level in compression options structure zstd: error reading stored compressor options from filesystem!

This is the more interesting issue. There is a flag in the super block to indicate whether compressor options have been stored or not after the super block.

If you haven't changed the compressor settings, tar2sqfs should not store any options and also not set this flag, but apparently unsquashfs reads the flag as set, tries to load the options but fails because there are none. I will investigate.