AgentD / squashfs-tools-ng

A new set of tools and libraries for working with SquashFS images
Other
194 stars 30 forks source link

mksquashfs vs gensquashfs behavior difference with directories that contain more than 256 entries (but less than 65534) #112

Closed errge closed 3 months ago

errge commented 1 year ago

I'm not sure if this is a bug or not, but during some sysadmin work, while I was carefully changing stuff from mksquashfs to gensquashfs (and sqfsdiff'ing to see if I screwed up), I noticed that the two tools behave differently with directories that have more entries than 256.

This is easily seen from the source code too, the old-school mksquashfs creates LDIR directories from 257 files, while gensquashfs from (0xFFFF-2) = 65534 files.

Now, I'm not an expert on mksquashfs, so I don't know which behavior is correct, that's why I created this ticket to ask for information. As far as I can read the format specification, the non-extended directory should only have max 256 entries, but the kernel and all tools work fine with more entries, so I might be misreading the spec.

So my question is: if we want to have similar behavior, should we change gensquashfs to limit to 256, or mksquashfs to raise the limit to 65534, similarly to gensquashfs.

If the answer is mksquashfs, then the patch would look like this in that git repo:

diff --git a/squashfs-tools/mksquashfs.c b/squashfs-tools/mksquashfs.c
index b3dd778..c887142 100644
--- a/squashfs-tools/mksquashfs.c
+++ b/squashfs-tools/mksquashfs.c
@@ -4293,7 +4293,7 @@ static void dir_scan6(struct dir_info *dir)
                        dir_scan6(dir_ent->dir);
        }

-       if((dir->count < 257 && byte_count < SQUASHFS_METADATA_SIZE))
+       if((dir->count <= (0xFFFF - 3) && byte_count < SQUASHFS_METADATA_SIZE))
                dir->dir_is_ldir = FALSE;
 }

Can you orient me regarding the part byte_count < SQUASHFS_METADATA_SIZE, this causes the mksquashfs tool to switch to ldir if the directory listing is more than 8192 bytes long, and I didn't find similar limit in the source code of gensquashfs. If I send a pull request to the mksquashfs guys, should I also remove this part in my patch or raise it to a higher number than this 8192 bytes (SQUASHFS_METADATA_SIZE).

Also, can you point out why limit was raised from 8 bits to 16 bits in the directory listing file count? Is this a change from squashfs3 to squashfs4, that the old mksquashfs just didn't caught up with so far? If yes, is this documented in some changelog somewhere? As far as I could go back in the history of gensquashfs from git, it was always 0xFFFF, even on 2019-05-01, when the project was created.

Sorry for spamming you to get info, but I researched this for a couple of hours, and I think I now reached the boundary, where it makes sense to steal a little bit of time from someone who is not completely new to this.

errge commented 1 year ago

On a related note: mksquashfs automatically links hard links together when ran on a directory (without description file), with gensquashfs this feature is not implemented.

Is this considered a valid feature request (automatic detection of hard links with -D), and should I open a ticket for it, or has it been decided, that this will not be implemented in the -ng toolset?

AgentD commented 1 year ago

Hi,

both directory indices and hard link detection were so far considered nice-to-have features and other things took precedence so far.

The basic infrastructure for directory indices was already there (emit a new record header every time a meta-block boundary is crossed, collect position/name data for the index), but unused.

The directory writer in libsquashfs was designed to use the smallest possible encoding for inodes and only pick extended ones if it is not possible to encode otherwise (e.g. directory > 64k, file offset > 4G, inode has xattrs).

Commit a90dfb17 makes some minor tweaks to use the same 256 entry threshold as mksquashfs and actually keep the index. I did some rudimentary testing with live DVD and router images I have sitting around.

With hard link detection currently still missing: The worst case so far are duplicated inodes and wasted time compressing. The de-duplication step catches files that have identical contents and merges them. So for hard linked files, there should right now only be one copy of the data and multiple inodes pointing to it.