AgentD / squashfs-tools-ng

A new set of tools and libraries for working with SquashFS images
Other
196 stars 30 forks source link

Xattr writer limits archive scale #68

Closed showermat closed 3 years ago

showermat commented 3 years ago

Hello,

I'm trying to use squashfs-tools-ng to create and read archives of several million small files, each having a few xattrs, some of which are highly duplicated and some of which are unique. In general, it's working impressively well, although I am hitting some scaling issues -- unsurprisingly, as I realize that SquashFS was not optimized for quite this scale.

When writing archives using my custom (currently single-threaded and unoptimized) binary, it starts off I/O-bound, processing around 100 files per second and using 25% CPU. As it approaches one million files written, it slows to 2 or 3 files per second and saturates one core, spending little time in I/O wait. A little amateur profiling suggests that the vast majority of the time is spent in sqfs_xattr_writer_end. Once the archive finishes writing, read operations are nice and snappy even with two million inodes.

Is it immediately obvious to you what's causing this bottleneck? I've invested little time in trying to understand the code, but I do see a sort and a linear search in there that seem suspect. If you have any ideas for fixing this, I'd greatly appreciate it, as this seems to be the only thing preventing the library from scaling effectively. I would be happy to try contributing to fix this with a little guidance.

AgentD commented 3 years ago

Hi,

when recording a key-value block, the Xattr writer internally resolves the key and value strings into 32 bit unique IDs using two hash tables (I actually expected those to be the limiting factor, since they use a primitive separate chaining implementation).

The 32 bit key+value IDs are combined into a 64 bit integer. Each block of key-value pairs is internally represented a a struct holding an array of 64 bit integers. The blocks are stored in a linked list (struct with flexible array member). The integers are appended to a dynamically growing array that holds all recorded key+value pairs. Each unique block of key+value pairs is represented by a struct pointing into the array, which is itself stored in a linked list.

The sqfs_xattr_writer_end function generates a 32 bit unique ID for the currently recorded block. It first sorts the block, then iterates linearly over the linked list of all previous blocks and compares them. If it finds a match, it uses the index of the existing block (and discards the recorded integers), otherwise it appends the new block to the list and generates a new ID.

If the sqfs_xattr_writer_end takes most of the time, I guess the list is very long, which means that you have a lot of unique xattr key/value pairs?

showermat commented 3 years ago

Thanks for that explanation. Yes, the approximate structure of my data set is that each file has four xattrs, two of which are massively duplicated and two of which are essentially unique. It is never the case that two inodes have exactly the same xattrs, so I understand then that this linear scan is always fruitless and the linked list grows to about the same size as the inode count -- millions.

At least in theory, it sounds like a tree structure could be used to track the xattr blocks in a way that supports faster lookup. I'm not sure how hard an improvement like that would be in practice. A less satisfying alternative would be to allow the user to specify a flag to disable deduplication of xattr blocks.

-- Matthew Schauer

On Tue, Oct 13, 2020 at 11:27:25AM -0700, David Oberhollenzer wrote:

Hi,

when recording a key-value block, the Xattr writer internally resolves the key and value strings into 32 bit unique IDs using two hash tables (I actually expected those to be the limiting factor, since they use a primitive separate chaining implementation).

The 32 bit key+value IDs are combined into a 64 bit integer. Each block of key-value pairs is internally represented a a struct holding an array of 64 bit integers. The blocks are stored in a linked list (struct with flexible array member).

The sqfs_xattr_writer_end function generates a 32 bit unique ID for the currently recorded block. It first sorts the block, then iterates linearly over the linked list of all previous blocks and compares them. If it finds a match, it uses the index of the existing block, otherwise it appends the new block to the list and generates a new ID.

If the sqfs_xattr_writer_end takes most of the time, I guess the list is very long, which means that you have a lot of unique xattr key/value pairs?

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/AgentD/squashfs-tools-ng/issues/68#issuecomment-707927680

AgentD commented 3 years ago

Hi @showermat ,

sorry for the delay. I got a bit stuck fixing another issue regarding the 1.0.4 release and unfortunately also ran a bit short on spare time as well. I hope to be doing a 1.1.0 release of the current master branch soon-ish.

The current master now contains two major improvements for the xattr writer:

I also added a small benchmark program to check the effect of my changes. Adding 1M unique blocks with each 4 entries takes a total time of ~6.2 seconds on my end. I aborted the benchmark on the old code after it ran for over ~10 minutes.

Perf analysis on the new code showed that ~25% of the time are spent in hash-table lookups and ~6% in rb-tree lookups. The rest is scattered across a number of libc functions and various house-keeping functions of the hash table.

showermat commented 3 years ago

No worries! Thanks for taking the time to look into this. I actually made a small patch that skipped the list scan and just always appended to the end, which seemed to work all right. Now I'm running into issues with all the directory entries in the tree I'm trying to write not fitting in the available memory on the laptop I'm using for this project. Once I find a way around that, I will definitely take these changes for a spin. It looks like a great improvement.