folbricht / desync

Alternative casync implementation
BSD 3-Clause "New" or "Revised" License
324 stars 44 forks source link

`untar` is missing seed support #242

Open RyuzakiKK opened 12 months ago

RyuzakiKK commented 12 months ago

In our use case, we have an A/B setup where we apply block based system updates with desync extract and use the active partition as a seed. For very small updates, using a block based .caibx introduces a lot of overhead. So we were evaluating to produce file based .caidx updates in those circumstances, as an alternative to speed up the updates when just a handful of files change.

However I noticed that desync untar doesn't have any seed support at the moment.

Do you have any suggestion on how a seed for untar should be designed? If we produce a .caidx of a local directory, can we grab its corresponding chunks from the filesystem without having to tar the entire local directory first?

FTR I can see that Casync handles .caidx with extract, and if you provide a path as a seed, it starts its process by seeding one by one all the files in the provided directory.

folbricht commented 12 months ago

I'm not sure it's possible or useful to have seeds for untar. Are you thinking about seeds to restore the full archive on the target and then untar? Or the untar stage itself? The former is possible already if you split up the process into the extract and untar stages. The latter may not be possible.

Wrt the untar stage, the issue is that with seeds you can check if a chunk is present or not. But there aren't any chunks available if the target is a directory. A caidx is just a caibx of an archive (catar), and there's no concept of chunks of the target. A chunk inside an archive can span over multiple files so you couldn't really say a file is changed or not until it's unpacked. If the file inside the archive is large, there's a chance that chunking the target files individually first may give you some overlap of chaunkIDs, but it's not going to work for smaller files.

If you have the disk space, can you maintain the caidx+catar file on the target with make/extract, then use untar (without -i) to restore the directory? This will let you use a seed. Or are you trying to save write I/O?

RyuzakiKK commented 12 months ago

Our primary goals are to reduce the download size and speed up the whole update process. Reducing I/O is nice to have, but not a primary concern at the moment.

If we produce two system images where just a few files changed, a .caibx will still require to download a lot of chunks from the store because, for example, the metadata of all the files changed too.

With .caidx (and --no-time) I noticed that instead we were able to significantly reduce the number of required chunks, especially for small system updates. Whether that translates to faster updates or not is still unclear. I'll run some benchmark to confirm.

Wrt the untar stage, the issue is that with seeds you can check if a chunk is present or not. But there aren't any chunks available if the target is a directory. A caidx is just a caibx of an archive (catar), and there's no concept of chunks of the target. A chunk inside an archive can span over multiple files so you couldn't really say a file is changed or not until it's unpacked. If the file inside the archive is large, there's a chance that chunking the target files individually first may give you some overlap of chaunkIDs, but it's not going to work for smaller files.

If you have the disk space, can you maintain the caidx+catar file on the target with make/extract, then use untar (without -i) to restore the directory? This will let you use a seed. Or are you trying to save write I/O?

Is this basically what Casync is doing? But instead of writing the catar file on disk they keep it in memory? I can check Casync implementation to see exactly how they are handling the seed part there.

I'll also try to create a catar as a first step and benchmark how long that takes. If creating a catar and extracting a caidx doesn't bring any real speed advantage, that might not be worth for us. In that case we'll probably just stick with caibx and evaluate how to further reduce images differences.

folbricht commented 12 months ago

Sounds like what you need is:

  1. Tar the files up on the source system with desync tar --no-time. You could use -i or use a two-step process with tar + make (faster)
  2. On the target, you you build the catar file withdesync extract ... file.caidx - The transfer should be quick since you can use seeds or caches
  3. On the target you extract the archive with desync untar (no -i)
RyuzakiKK commented 12 months ago

Given that on the target we just have the extracted archive files around, unless I'm missing something, the actual steps will end up being:

On the source:

  1. desync tar --no-time

On the target:

  1. desync tar --no-time, to create a usable seed from the files in the currently active partition
  2. desync extract
  3. desync untar
  4. Delete the generated seed catar and the downloaded catar update

And in order to successfully apply that we would need a free space that is at least 2x the full update size. That's a substantial tradeoff compared to the caibx where we don't need any reserved free space at all.

To skip the step 2. we would need to keep the seed catar always around, but always using all that disk space is probably unfeasible.

The more I look and this, the more I suspect caidx will sadly not help us to speed up small system updates.

folbricht commented 12 months ago

Correct, if you can't keep the catar on disk on the target it's going to require some extra time, either for processing, or for transferring.

Something that's not yet implemented which could help here is to introduce something like a "DirectorySeed" where desync scans a directory and chunks all large files into a kind of index without storing the chunks. The seed index would contain a map of chunkid => {filepath, position in the file}. Then the untar would be able to use chunks directly out of target files. Building this Seed is going to require processing the target files again so it'll add some time. Though that could be faster than transferring the chunks, depending on the situation.

I suppose this index could even be persisted to disk for the next update.

folbricht commented 12 months ago

Perhaps this seed index could be produced as part of the untar operation, to then help find chunks during future untar.

RyuzakiKK commented 12 months ago

Yes, that could definitely help.

If we decide to actually use untar for small updates, I can probably try to implement something like that.

RyuzakiKK commented 12 months ago

Given that on the target we just have the extracted archive files around, unless I'm missing something, the actual steps will end up being:

On the source:

  1. desync tar --no-time

On the target:

  1. desync tar --no-time, to create a usable seed from the files in the currently active partition
  2. desync extract
  3. desync untar
  4. Delete the generated seed catar and the downloaded catar update

And in order to successfully apply that we would need a free space that is at least 2x the full update size. That's a substantial tradeoff compared to the caibx where we don't need any reserved free space at all.

I guess that another alternative here would be to merge desync extract and desync untar into a single command. In this way we could unpack the catar on the fly without having to write it on the disk first.

More or less the same concept of desync untar -i, which starts to extract the catar as it assembles it.

lights0123 commented 1 month ago

I'm interested in implementing this for our own IoT updating system. Any tips on where to start? My ideal workflow would be:

Instead of using A/B partitions, each version is a btrfs subvolume so that updates can reflink the majority of shared files—keeping many past versions is important for our case, and this allows that without extreme disk usage.

folbricht commented 1 month ago

What does "image" in this refer to? Normally I think of an image as a single file, but then tar/untar wouldn't be needed so perhaps in this case "image" is a directory of multiple files?

When dealing with untar and seeds, we'd first need to define what a seed for a directory actually looks like. Perhaps another directory of similar files? But during untar while reading the stream, how would you know that the file being unpacked matches one already on disk. You'd have to fully unpack it first and then compare which doesn't really help with being fast or efficient.

lights0123 commented 1 month ago

so perhaps in this case "image" is a directory of multiple files

Yeah, I didn't word that the best. It's a Linux root filesystem that is packaged as a GPT disk image for initial factory deployment, but I'd like future updates to happen at the filesystem level.

When dealing with untar and seeds, we'd first need to define what a seed for a directory actually looks like. Perhaps another directory of similar files? But during untar while reading the stream, how would you know that the file being unpacked matches one already on disk. You'd have to fully unpack it first and then compare which doesn't really help with being fast or efficient.

I was hoping to use the previously deployed filesystem's actual files as the seed. I should clarify that I deploy this read-only filesystem, then make a writable btrfs snapshot of it. This leaves the originally deployed filesystem intact on-disk, while allowing efficient small changes online. To my understanding,casync extract --sync=/mnt/btrfs/read-only-original-fs would do what I want, just much slower than necessary.

I'm OK with depending on this original read-only directory being unmodified (thus making the created index completely valid). btrfs requires explicitly changing a subvolume to read-write, and even then it updates a UUID that I can verify myself before attempting an update. This logic is used by btrfs's built-in send & receive feature to ensure its seed data is intact without checking every file.

Alternatively, I could store the output of desync tar in the filesystem. I hope that the chunk store should be able to reflink most of the actual files on disk, making it only take up dirent space.

lights0123 commented 4 weeks ago

Though I imagine for other cases where integrity can't be trusted (like the Steam Deck in the original question if read-write was enabled), files in the already-deployed directory could be verified right before they're copied/reflinked.

folbricht commented 4 weeks ago

What's the primary goal in this use-case? Is it to reflink as many files in new a image/directory as possible to be efficient with disk space? Or is that secondary with performance being more important?

If your images were single large files (which could then be mounted), the desync extract command would already do this. Doing this with untar would be very difficult. The main issue is that reflinking happens on files, but chunks in a catar can span file boundaries.

I can think of a couple of ways this may be possible, but it's going to be either quite complex to implement, or slow.

  1. One could for example first read the old catar and build an index of which chunk is contained in which file(s) and at what offsets. The while while running untar on the new catar, find out what chunk is producing the data stream for each file, and if that chunk exists in the old image, reflink it from there instead.

  2. Or, for a simple solution that's not as good. When writing a file during untar, just see if a file of the same name exists in the old dir, open that old file and then compare the streams. Everything up to the first mismatch could be reflinked. This method is reasonably fast, but doesn't take advantage of the chunking algorithm at all. On the plus-side, it wouldn't even require having the old catar available.

lights0123 commented 4 weeks ago

Disk space and bandwidth are my primary concerns: if I update a 4kB config file or 1MB binary on the 10GB filesystem, I would like to only use a few MB of disk space and bandwidth to update it. This means that I would not want to download a full catar, but rather a caidx and fill most of the chunks with stuff already on-disk.

The main issue is that reflinking happens on files, but chunks in a catar can span file boundaries.

There's also FICLONERANGE for block-aligned (always 4KiB?) sharing.

One could for example first read the old catar and build an index of which chunk is contained in which file(s) and at what offsets. The while while running untar on the new catar, find out what chunk is producing the data stream for each file, and if that chunk exists in the old image, reflink it from there instead.

To preface this, I am unfamiliar with the layout of the catar format—this is based on my assumption that it is similar to tar.

I was imagining a format that stays on the target that says, for example, "chunk ab3681... is bytes 0x0000 to 0xFFFF of /abc" or "chunk bf5311... is the header (in tar terminology) of /a, then bytes 0x0000 to 0x1000 (the entirety) of /a, then the header of /b, then bytes 0x0000 to 0x1000 (the entirety) of /b". When extracting untaring the caidx, if chunk bf5311 is requested, files /a and /b could just be reflinked to the destination directory.

(slightly more efficient would be to start with a read-write btrfs snapshot of the seed data and update it because btrfs snapshot is much faster then reflinking every file, but I don't need this level of efficiency and I imagine it would be hard to maintain consistency since extract/untar would have to delete files)

folbricht commented 4 weeks ago

The catar format is somewhat similar to regular tar but it has a concept of directories which have header and footer. But if you just consider files, it should work the same in this use-case.

What you're describing is quite similar to 1) above. It could work. Unfortunately it requires the introduction of a new type of directory-index file. We'd need a command to create it, and one to verify it as well since verifying that the seed directory-index matches the seed dir is likely not feasible during the untar stage itself. Best to have the ability to do that separately.

Once you have the directory-index (we need to find a good name for this file) and want to untar a new stream, you'll need to make the stream decoder aware of what chunk is used to produce what section of the stream. If you look at untar.go today, the are some goroutines who fetch the chunks, and others that assemble the stream from them. These layers are currently separate.