AgentD / squashfs-tools-ng

A new set of tools and libraries for working with SquashFS images
Other
194 stars 30 forks source link

rdsquashfs feature suggestion: hardlink duplicate files on extract #73

Open Zaxim opened 3 years ago

Zaxim commented 3 years ago

tl;dr: I have a squashfs file with millions of duplicated files in them, it would be awesome to be able to extract the image and hardlink (or reflink) the duplicated files

My specific use case is an abuse of the intended functionality of squashfs, but I have been using squashfs as a directory archival tool to consolidate dozens of Apple Time Machine backup folders [1]. Time Machine uses directory hardlinks to snapshot the entire filesystems and preserve space, but I have Time Machine backups from different drives and systems which don't share those hardlinks but have very similar files. mksquashfs has been the only tool that's been able to scale to the number of files and hardlinks that I'm dealing with and properly do deduplication as I append directories to my single squashfs file.

I can always mount the squashfs image and browse to the specific files/folders I want to retrieve, but I was thinking it would be cool to be able extract the image and use the deduplication table to create files on the disk as hardlinks or reflinks on COW filesystems such as BTRFS. I'm not sure how hard this would be to implement in rdsquashfs to do so.

[1] There are pitfalls with using mksquashfs on Apple Time Machine folders. Namely, squashfs does not support all the crazy xattr stuff that macOS applies to files, so some things don't restore completely, but as a file archive, it works fine.

AgentD commented 3 years ago

Only unpacking duplicated files once and creating copy-on-write reflinks sounds like a very interesting idea.

On Linux this would be done with an FICLONE, FICLONERANGE or FIDEDUPERANGE ioctl. On MacOS and BSD I have not found an explicit way to do this yet. I think* this can be done implicitly through the fcopyfile syscall on MacOS.