TagStudioDev / TagStudio

A User-Focused Photo & File Management System
https://docs.tagstud.io/
GNU General Public License v3.0
4.82k stars 360 forks source link

[SUGGESTION] Possible Solution for relinking renamed/moved files #36

Open lloyd094 opened 4 months ago

lloyd094 commented 4 months ago

First: I don't have a full grasp as how two tags are implemented, so this may not be an option.

Can this issue be solved by creating a checksum based off the file and/or providing the file a hidden tag / database ID number, which would then continue with the file? The idea would be this number is unique, and can then be found later on if the file was previously in Tag Studio but later removed.

For example, if I took file cat.png, after it was already in Tag Studio and it was given ID: 1, then I removed it from my computer and later added that same file that had the hidden ID: 1, TagStudio would then know what that file is (or it's other associated tags).

This implementation may require a local database or log file, which may be against your implementation idea, but it could help with recovery on that specific computer.

CyanVoxel commented 4 months ago

Are you referring to keeping some sort of unique ID inside of the file, via something like EXIF metadata?

Right now, TagStudio assigns each file with it's folder path a unique ID that it keeps track of inside an internal database. If the file is moved, and that path+filename key becomes orphaned - however TagStudio can search and relink that file if it finds a file with the same name in your library directory. There's lots of cases where this wouldn't work though, but we've got plans to implement additional methods of detecting when files are moved, modified, and renamed - including checksums and hashes.

Anyway, if you were referring to storing a unique id inside of files to help aide with this process, it's on our radar but is lower down the list of things to try, since it involves modifying users' files and will only work with files that support having that kind of metadata written to them. Either way, thank you for the suggestion!

lloyd094 commented 4 months ago

Yes! The unique ID attached to the file itself if possible, and if not (or for files that don't support it) something with checksums to give the file a better chance to be recognized. Thank you!

Trevo525 commented 4 months ago

Instead of modifying the files in any way, why not create a checksum of each file and use that as an id? When a file gets added that a checksum already exists, it could then be checked if it's a duplicate. If the file doesn't exist in it's original location anymore, then you can change the path, no harm no foul. If the file exists in both locations, then we can prompt the user for which file to keep (and delete the other), or maybe allow it keep both and one of them becomes the default for when you copy/paste.

Just brainstorming. I wouldn't want TagStudio to edit the files in any way personally.

CyanVoxel commented 4 months ago

Instead of modifying the files in any way, why not create a checksum of each file and use that as an id? When a file gets added that a checksum already exists, it could then be checked if it's a duplicate. If the file doesn't exist in it's original location anymore, then you can change the path, no harm no foul. If the file exists in both locations, then we can prompt the user for which file to keep (and delete the other), or maybe allow it keep both and one of them becomes the default for when you copy/paste.

Just brainstorming. I wouldn't want TagStudio to edit the files in any way personally.

Something like this is what I intend to check for, along with other attributes such as date created, date modified, etc, before thinking about resorting to modifying the files. I wouldn't make the checksum the actual ID for the file however, since checksums will change when the files are modified and that can cause some issues. But as another factor for determining which file is which, absolutely!

Icosahunter commented 4 months ago

I created an algorithm for relinking files for you here: https://github.com/Icosahunter/relink_utils_for_tagstudio It can optionally take hashes for when you add that (it includes a simple function to create hashes and assumes you use this). I made the interface independent from Tag Studio's implementation, it just takes file paths (and optionally a hash) as inputs.

shfeat commented 3 months ago

How about file's md5? Considering performance, we perform md5 based on the first 64k bytes

yedpodtrzitko commented 3 months ago

it can optionally take hashes for when you add that

fyi there's hashlib.file_digest for calculating file hashes

yedpodtrzitko commented 3 months ago

one more thing to this topic - using xxhash could be more performant than sha1 etc., no matter which way sha1 goes through the file

shfeat commented 3 months ago

one more thing to this topic - using xxhash could be more performant than sha1 etc., no matter which way sha1 goes through the file

wow, seem to be a better solution

KillyMXI commented 2 weeks ago

Preserving relevant discussion from Discord, with minor edits. Happened at August 30th.

Start of the discussion: https://discord.com/channels/1229183630228848661/1229309667528806420/1278835404715458590

Computerdores — Today at 1:55 AM

wouldn't it be possible to "just" store a hash of the file? and use that to detect moves / renames?

CyanVoxel — Today at 1:56 AM

Storing a hash of the file is one of the ideas included in the video and linked issues. It could be used to help relink a move, or detect a change, but not both at once. It's a tool we're planning on leveraging though

Killy — Today at 2:01 AM

Similar thing that Git tries to solve a bit further. When you have a changeset, after detecting clear moves, if there are still deleted and added files, they can be compared for "strong similarity" But Git knows what file was like before

Without versioning, guessing the move + minor edit might be achieved by comparing known metadata or employing special hash functions (perceptual hash, media fingerprinting) (EDIT: https://en.wikipedia.org/wiki/Perceptual_hashing)

radu.m91 — Today at 2:27 AM

Does the filesystem inode number change when the file is moved or edited? I suspect it doesn't. This means that while displaying the files, the program could check the filepath, modified time, content hash and inode number in order to detect if the file has been moved or edited, before displaying it. If it has been edited the inode number would be the same, but the content hash would differ. If it has been moved, the content hash of that inode would be the same, and the program could search for filepaths pointing to that inode in order to find the new location. (EDIT: https://en.wikipedia.org/wiki/Inode)

AiSO'); DROP TABLE users;-- — Today at 2:43 AM

If a file on linux is moved into a different folder, renamed or its content changes, the inode remains the same.

Killy — Today at 2:52 AM (reply to Killy)

Side effect of having hashes of all files - easy to find exact duplicates. And similar media in case of perceptual hashes afaik, possibly even with adjustable level of similarity. That's how duplicate search works in Hydrus and https://github.com/ermig1979/AntiDupl

Killy — Today at 3:03 AM

https://xxhash.com/

yedpodtrzitko — Today at 5:47 AM

it would be good to concentrate the discussion and suggestions in the Issue which Cyan mentioned ( https://github.com/TagStudioDev/TagStudio/issues/36 ), nobody gonna find it here within a week among the stream of messages.

K — Today at 10:09 AM (reply to Killy)

problem with hash it it long and kill ur hdd speed and cpu

Killy — Today at 12:16 PM (reply to K)

You have to be smart about it. Avoid recomputing hashes. Use fast hashing algorithms. (That's why I linked xxhash.) It may still matter for big files though. The tools I've mentioned are very efficient on image files.

How things work on big files - can be seen in another media library application called "stash" - it is intended mostly for videos. It still computes file hash for all added files, among other things that take more time anyway. Only perceptual hash is an option that is not enabled by default. (EDIT: https://github.com/stashapp/stash/)

K — Today at 1:03 PM (reply to Killy)

xxhash

First I hear of this, does people use or support this hash though? I just had a deeper look at their website, I recognize some software there hmm CLI only software from look of their git release

Killy — Today at 1:29 PM (reply to K)

It's an algorithm with implementations or library bindings in different languages. Intended for integration into your software - TagStudio for example. Its entire concern is to return a hash for a given file as fast as possible. The usage of that hash is a concern of the client application - be that move detection, duplicate detection, integrity check or whatever else...

Killy — Today at 1:38 PM

There are a lot more non-cryptographic hash functions (read: not slowed down artificially) Many (not all) are mentioned at https://en.wikipedia.org/wiki/List_of_hash_functions But xxhash seems best known, well regarded

radu.m91 — Today at 4:21 PM (reply to Killy)

That's why I mentioned "filepath" and "modified time" before "content hash". The hash value can be "cached", and updated only if the "modified time" changes, or if the file disappears and is relocated using the inode. So, basically, each file has 4 identifiers (filepath, modified time, content hash, inode number) from which any 2 can be used to re-compute the other 2. Let's think of any edge cases where this might not hold!