9001 / copyparty

Portable file server with accelerated resumable uploads, dedup, WebDAV, FTP, TFTP, zeroconf, media indexer, thumbnails++ all in one file, no deps
MIT License
502 stars 28 forks source link

Question: What happens to duplicate files in different subfolders/with different names? Are they symlinked or duplicated? #44

Closed Gremious closed 1 year ago

Gremious commented 1 year ago

The docs mention e.g. --hardlink: creates hardlinks instead of symlinks and that upload doessymlink/discard duplicates (content-matching) but...when are symlinks created? (if I just missed a section of the docs please do say)

Please consider this example:

I have both e2dsa and e2ts in my global config. I have uploaded a pictures/ folder with just all of my photos in it, to the same-named copyparty volume. Then, on the OS, I make a new folder, pictures/my-may-album in which I copy photos from the pictures/ folder into, to make a little album. Then I just drag-and-drop upload to copyparty's pictures/, so now it has pictures/my-may-album/ with copies of existing files.

Does copyparty deduplucate those? Are they symlinks now, can I delete ether end safely? Or does it only de-duplicate if they have the same name, directory AND contents?

The filekeys for both of those stay different, and doing a file search finds both copies as "0 diff", so it's hard to tell, and the mention of symlinks in regards to de-duplication got me a bit confused.

9001 commented 1 year ago

Good question! While your exact scenario is fairly easy to answer for (see the bullet-points at the end!), I'd like to take the chance to generalize a bit, and try to answer for deduplication in general. And I want to mention the usual pitfalls of deduplication with symlinks as well, so the different consequences make more sense.

But before I continue, note that deduplication is, by default, effectively disabled when running on windows -- since windows does not permit creating symlinks unless you run copyparty as Administrator. So most of the below mainly applies to running copyparty on Linux or macos.

Copyparty will perform deduplication by symlinking duplicate files during upload -- and only during upload -- assuming it knows about at least one matching file inside the volume. This is done based on file contents; deduplication happens as long as the same data exists anywhere inside the volume. The restriction of only deduplicating files within the same volume can be lifted with --xlink. Symlinking is the default approach to deduplication, since copyparty mostly expects you to use the web-UI to manage uploaded files... So depending on your usecase, --hardlink or even --no-dedup may be a better choice (explained later).

If you have duplicate files on your filesystem which originate from non-copyparty activities (local OS file explorers and such), then these will always be left as-is, and not deduplicated. When uploading a file to copyparty, and copyparty realizes it has one or more copies of that file already, it will pick "the closest" dupe to the upload destination. Meaning, if you already have two identical copies of a pic on the server's filesystem, namely a/one.jpg and a/b/two.jpg, and you use copyparty to upload three.jpg into a/b/c, then it will pick a/b/two.jpg as the origin to symlink from.

Note that the symlinks are relative, meaning that a/b/c/three.jpg would be pointing to "two.jpg in the parent directory" -- this matters in regards to managing your files using a local OS file explorer.

If you delete a file using the copyparty UI, it will make all the appropriate changes on the filesystem to ensure that all deduplicated files are OK. That is, if you delete the "origin" (initial copy / original file), then one of the dupes will get promoted to the new origin file, and all symlinks will be rewritten so they point to this file instead.

If you delete a symlink (dupe) using the regular file manager in your OS, then that is fine; copyparty will notice this and forget the dupe on the next rescan. However, deleting the origin in this manner, or just moving either a symlink or an origin to a different location (as they are linked relatively), will cause issues as the dupes will now be dangling symlinks -- in worst case, loss of data!

For this reason, you may wish to use --hardlink, which changes the deduplication approach to using hardlinks instead of symlinks. Hardlinks carry the advantage that you can delete any of the dupes, using any software you prefer, without running the risk of data loss. However, hardlinks have the disadvantage of appearing as if they are completely normal files, which can be dangerous: if you make any modifications to a hardlinked file, then this modification will also apply to all other linked copies of that file. There is a second disadvantage to hardlinks, which is that they share the same last-modified timestamp. So if you rely on the last-modified timestamp to synchronize files between machines, then you will want to use symlinks instead.

Note: Apparently windows 11 lets you create hardlinks as a regular user, but not symlinks... So you can choose to enable deduplication on windows 11 by specifying --hardlink, with all of the caveats mentioned in the previous paragraph. When running without --hardlink, such that deduplication is nonfunctional on windows, note the log messages when you upload a dupe; the final line indicates that deduplication failed and that a full copy was created instead:

Screenshot_2023-07-10_22-12-26-or8

The final option, --no-dedup avoids all of these issues, and still provides the speed benefit of dupe detection during upload, in the sense that copyparty will tell the client to skip uploading the file and instead make a local copy on the filesystem -- only with the disadvantage that each copy of the file will take up the full amount of disk space.

Filekeys are calculated based on filesystem location + filesize + inode-number, so these should be different for each of the dupes.


And finally, regarding your specific case:

Would be awesome to eventually migrate this to a proper handbook or something actually :>

Gremious commented 1 year ago

Thank you SO much for such a detailed write up - I really appreciate it! This clears up all of my questions perfectly c:

Would be awesome to eventually migrate this to a proper handbook or something actually :>

Yeah, the readme seems to be growing and growing ehe, might be time to start thinking abt moving it to a proper doc book...

Gremious commented 1 year ago

Oh, actually a small note - I was about to ask if I could set --no-dedup per volume, but searching the repo answered my question (copydupes volflag) (wanted to mention it here for people stumbling on this question)

When you ever do write the docs for all this, please do mention it as well c: