Avoid overwriting files that are unchanged?

polarathene commented 2 months ago

Is it not possible for desync to avoid modifying files that have no difference?

There is no blocks/chunks to update, yet the mtime is modified each time I run the untar command? (desync untar appears to be the equivalent of casync extract for a directory tree?)

UPDATE: See follow-up comment. At a glance I think desync could diff between two index (caidx) files for before/after, to filter out files with no change in their content digest (and perhaps the other metadata attributes) as desync mtree / desync info can derive information from a common store dir and caidx files?

Context

I am new to casync / desync, there's a lot of options/commands and jargon to ingest, so perhaps I've misunderstood something. I've looked over existing issues and this may be a duplicate of https://github.com/folbricht/desync/issues/242 or just overlap with it.

In my scenario, I wanted to sync changes from the archive (src) to the target (dest) based on file content (not concerned with file metadata changes at this point). The impression I had was desync could effectively detect only what needs to be updated from an index/store. The linked issue suggests this is a problem with untar and needing support for providing a seed.

Wrt the untar stage, the issue is that with seeds you can check if a chunk is present or not. But there aren't any chunks available if the target is a directory. A caidx is just a caibx of an archive (catar), and there's no concept of chunks of the target.

A chunk inside an archive can span over multiple files so you couldn't really say a file is changed or not until it's unpacked.

mtime or other metadata related changes is problematic within Docker images. Similar to the linked issue, I'm interested in updating a filesystem root with only the subset of changes from the archive (typically much smaller than the existing destination target).

When all files are modified redundantly, the new Docker layer will duplicate that file content in full which is undesirable.

Reproduction

$ docker run --rm -it fedora:41

# Get desync:
$ curl -fsSL https://github.com/folbricht/desync/releases/download/v0.9.6/desync_0.9.6_linux_amd64.tar.gz \
  | tar -xz --no-same-owner -C /usr/local/bin desync
# Prep basic content example:
$ cd /tmp && mkdir -p src && touch src/file
# Add content if it makes a difference (24 bytes):
$ echo 'this is a quick example' > src/file

# Avoid storing mtime:
# NOTE: `casync make` supports archiving filesystem directories to castr stores, but `desync make` does not? (`desync tar` instead?)
$ mkdir store-src
$ desync tar --no-time --store store-src --index src.caidx src
$ desync untar --store store-src --index src.caidx dest
# Alternatively (without index):
# desync tar --no-time src.catar src
# desync untar src.catar dest

$ ls -li dest
total 4
317278 -rw-r--r-- 1 root root 24 Sep 11 09:00 file

# Wait a minute and try again:
$ desync untar --index --store store-src src.caidx dest
$ ls -li dest
total 4
317278 -rw-r--r-- 1 root root 24 Sep 11 09:01 file

# Inspect archive:
$ desync mtree -i -s store-src src.caidx
# Alternatively (without index):
# desync mtree src.catar

#mtree v1.0
. type=dir mode=0755 uid=0 gid=0 time=0.        0
file type=file mode=0644 uid=0 gid=0 size=24 time=0.000000000 sha512256digest=97b0fc819edb24745c11422b30476acf214a8459d888fb5dda857ee9bb195a5e

It does manage to avoid replacing the inode unlike casync which is an improvement I think? However I'd rather it not unnecessarily modify files.

polarathene commented 2 months ago

While this issue seems related to https://github.com/folbricht/desync/issues/242 I think what I'm asking for is a reliable way to apply an update from the store via an index diff? (since I can have the destination indexed in advance)

Redundant writes should be preventable with information desync already has available?

Example (base vs update)

# Initial content:
$ ls -l /root-fs
total 4
-rw-r--r-- 1 root root 24 Sep 11 09:00 file
$ ls -l /root-fs-b
total 4
-rw-r--r-- 1 root root 24 Sep 11 09:00 file

# Initial store + seed index:
$ desync tar --no-time -s store -i existing.caidx /root-fs

# An update (create a new index + update store):
$ touch /root-fs-b/new-file
$ desync tar --no-time -s store -i update-b.caidx /root-fs-b

# Compare existing index vs updated index:
desync info --format plain --store store --seed existing.caidx update-b.caidx
Blob size: 377
Size of deduplicated chunks not in seed: 377
Size of deduplicated chunks not in seed nor cache: 377
Total chunks: 1
Unique chunks: 1
Chunks in store: 1
Chunks in seed: 0
Chunks in cache: 0
Chunks not in seed nor cache: 1
Chunk size min: 16384
Chunk size avg: 65536
Chunk size max: 262144

$ desync mtree -s store -i existing.caidx
#mtree v1.0
. type=dir mode=0755 uid=0 gid=0 time=0.        0
file type=file mode=0644 uid=0 gid=0 size=24 time=0.000000000 sha512256digest=97b0fc819edb24745c11422b30476acf214a8459d888fb5dda857ee9bb195a5e

$ desync mtree -s store -i update-b.caidx
#mtree v1.0
. type=dir mode=0755 uid=0 gid=0 time=0.        0
file type=file mode=0644 uid=0 gid=0 size=24 time=0.000000000 sha512256digest=97b0fc819edb24745c11422b30476acf214a8459d888fb5dda857ee9bb195a5e
new-file type=file mode=0644 uid=0 gid=0 size=0 time=0.000000000 sha512256digest=c672b8d1ef56ed28ab87c3622c5114069bdd3ad7b8f9737498d0c01ecef0967a

References (not useful to maintainers)

These are more for myself since I discovered them while searching issues 😅 - [`--no-time`](https://github.com/folbricht/desync/issues/124#issuecomment-535277771) (_don't store `mtime`_) - [`desync info`](https://github.com/folbricht/desync/issues/248#issuecomment-1763314930) - [`desync mtree`](https://github.com/folbricht/desync/issues/123#issuecomment-531575757)

Diffing between index files

With these two index files, desync would at least be able to know from the content hash digests which existing files have since been modified? Enabling the ability to filter out writing them redundantly to the destination?

# A diff between the two:
$ diff <(desync mtree -s store -i existing.caidx) <(desync mtree -s store -i update-b.caidx)
3a4
> new-file type=file mode=0644 uid=0 gid=0 size=0 time=0.000000000 sha512256digest=c672b8d1ef56ed28ab87c3622c5114069bdd3ad7b8f9737498d0c01ecef0967a

# Just the unique digests from the diff output:
$ diff <(desync mtree -s store -i existing.caidx) <(desync mtree -s store -i update-b.caidx) \
  | grep -oP 'sha512256digest=\K.+' | sort -u
c672b8d1ef56ed28ab87c3622c5114069bdd3ad7b8f9737498d0c01ecef0967a

It seems that basic file metadata is encoded for each file from the mtree command output, but that's as desired unrelated to the the digest stored for it's content (I imagine any overhead from that would be minimal though when diffing between the two indexes?).

polarathene commented 2 months ago

Unrelated to this issue, but spotted these during usage:

Typo caibx, should be caidx:

https://github.com/folbricht/desync/blob/0aef76def97e1aa148d58b8a831f7c89b41ba9c4/cmd/desync/tar.go#L45-L46

Typo sha56, should be sha256:

https://github.com/folbricht/desync/blob/0aef76def97e1aa148d58b8a831f7c89b41ba9c4/mtreefs.go#L55

folbricht commented 2 months ago

Thanks for letting me know about the typos. Fixed them

desync untar does indeed overwrite all files since it doesn't know if it has changed or not. This behavior could be changed, but first let's look at the mtime issue. Based on https://github.com/folbricht/desync/blob/master/localfs.go#L84-L92, it should set the same mtime if one is available in the archive. When you made the catar, did you use --no-time as that would mean there's no time in the archive and files would have a new mtime after untar.

As for only overwriting files that are different, it'd be possible to unpack every file from the archive into a tempfile somewhere, then compare the content to what's on disk, and replace the old file with the tempfile if it's different. This would require extra space (as much as the size of the largest file) and would likely be slower.

However, I can think of valid use-cases for this. For example if the target is on a flash drive and the goal is to reduce wear on it. Or when the target FS is very slow, one could unpack the tempfiles into memory first and not use the slow FS.

polarathene commented 2 months ago

When you made the catar, did you use --no-time as that would mean there's no time in the archive and files would have a new mtime after untar.

Yes, the intent was that if the file itself had not changed in content, nor the other metadata like permissions and ownership (although I don't think that's relevant to mtime, only the file content?) I would expect that the mtime is not updated.

This is for a scenario where I wanted to sync the delta but with mtime ignored (as the newer version may have already generated existing content with a new mtime but otherwise no difference). I tried to point this concern out in the "context" section I provided where a Docker image layer will duplicate storage required for a file (a new layer contains the copy) just by attributes like ownership, permissions or mtime being updated.

There are package managers that let you install your own root fs target. If after this point you wanted to add some new packages however, by overlaying that root-fs with a modified variant, the mtime update makes that layer much bigger than it needs to be. I learned about casync and then desync projects and thought that they could have worked around that, but the extraction seems to force an mtime update for files without a way to opt-out.

I suppose --no-mtime for extraction is desired, but the issue is likely more about filtering what is actually extracted/written when a diff between caidx is viable?

folbricht commented 2 months ago

I'm not able to reproduce it. Here's what I tried:

Create a source-tree:

mkdir source
echo testing >source/file.txt

Query the mtime of the file inside it and make a catar:

stat -c %y source/file.txt
desync tar archive.catar source/

The mtime is shown as 2024-09-29 14:50:19.200310065 +0200.

Then unpack the archive and display the mtime of the file:

desync untar archive.catar target
stat -c %y target/file.txt

It too shows 2024-09-29 14:50:19.200310065 +0200.

Are you able to make a simple repro for what you're seeing?

polarathene commented 1 month ago

Are you able to make a simple repro for what you're seeing?

I provided one at the very top of the report under "Reproduction" header. It uses Docker, that should provide you an environment with the same conditions to reproduce.

Just to re-iterate since I think you missed the context with your last reply, the mtime is not stored in the archive as I do not want to modify files that otherwise unchanged beyond mtime, yet mtime is still updated.

When I say mtime is updated/changed, I mean that it becomes the time of extraction due to --no-time. That would be ok for the initial extraction when the file didn't exist, but when it already exists, there should be no difference in the two archives for unchanged files when --no-time is used? So why would the mtime be updated unless it's naively overwritten?

folbricht commented 1 month ago

So why would the mtime be updated unless it's naively overwritten?

That's exactly what's happening currently, and it does not matter it there's an mtime in the archive or not. Files are always overwritten.

Having said that, if you include the mtime during tar, then untar will set the mtime of the extracted file to that value (after extracting it). The file would still have been written, but the mtime may be the same if the source-file during tar wasn't changed. So in your use-case, perhaps including the mtime will help?

polarathene commented 1 month ago

So in your use-case, perhaps including the mtime will help?

The 2nd archive would be generated at a different mtime, so no I don't think it would help. I could perhaps pre-process mtime to 0 before archiving with desync and see if the overwritten files still result in duplicate data in the new OCI image layer. The inode isn't changed with desync like it is with casync, so perhaps it would otherwise be identical.

I can alternatively use something like rsync for this, I just liked the other perks that a solution like desync had to offer but dealing with a directory tree of files is a bit of a friction point it seems 😅

folbricht commented 1 month ago

One thing I could do is add an option to untar to set mtime to a fixed value after writing each file (could also set it to zero). Though not sure this would help you if something later relies on the mtime of updated files being different. Let me know if that'd be useful.

polarathene commented 1 month ago

One thing I could do is add an option to untar to set mtime to a fixed value after writing each file (could also set it to zero).

I think that was my expectation with --no-time but for the untar command instead. I get the feeling it may not make much of a difference though if the write is still considered as adding to a new layer in the Docker image.

No rush to resolve this any time soon, it's mostly to document awareness and see if any other users chime in with similar needs. My use-case may depend on a less naive extraction process as discussed at https://github.com/folbricht/desync/issues/242 to skip writes for files with no actual difference in their content.

In my scenario with Docker layers, it was for using the chisel tool to install minimal packages, then extending the image by using chisel again to add additional packages as you might any other package manager, but chisel AFAIK has the same issue of naive extraction to the target (it effectively builds a list of subset of package contents known as "slices" from a proper package manager, which then extracts a copy of that subset to the destination).

chisel is also written in Go (and developed by Canonical), but until they resolve that same seeding concern I thought I would look at something like desync as a workaround solution 😅 (maybe if desync can work as a library, it'd be useful for chisel?)

folbricht / desync