Add option for ignore file content

Arseney300 commented 2 years ago

Hello. In cases, when files in directory are very large, fdupes can work a very long time. But i think, be great, if fdupes will have special option for ignoring content of files and compare only by sizes. I add new "-c --ignore-content" option, new compare function, and make small crunch in checkmatch for avoiding reading a whole file.

adrianlopezroche commented 2 years ago

Comparing files by size is not at all a reliable way of detecting duplicates. Such a feature would fall outside of fdupes' intended purpose.

On Sat, Sep 4, 2021, 7:01 AM Arseney Mesheryakov @.***> wrote:

Hello. In cases, when files in directory are very large, fdupes can work a very long time. But i think, be great, if fdupes will have special option for ignoring content of files and compare only by sizes. I add new "-c --ignore-content" option, new compare function, and make small crunch in checkmatch for avoiding reading a whole file.

You can view, comment on, or merge this pull request online at:

https://github.com/adrianlopezroche/fdupes/pull/160 Commit Summary

add -c option

update --avoid-content

change naming

File Changes

M fdupes.c https://github.com/adrianlopezroche/fdupes/pull/160/files#diff-a279a3be8c0ffbf671c08a3d17376b936ea857bdc5742f1e01e9b0a143b93836 (73)

M flags.h https://github.com/adrianlopezroche/fdupes/pull/160/files#diff-8ea76a8a74222d114ae9560e7d8dcfda511a8f7c3da4c46d1be598b2ca0b3142 (2)

Patch Links:

https://github.com/adrianlopezroche/fdupes/pull/160.patch

https://github.com/adrianlopezroche/fdupes/pull/160.diff

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/adrianlopezroche/fdupes/pull/160, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPQT7JRNHZDHKQCHDUFRALUAH37JANCNFSM5DNFTGLQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

moonshiner commented 2 years ago

totally agree checksums are better.

But I've written something for myself that does this and keeps the hashes saved to make it easier to compare with other folders on other file systems.

Perhaps an option to save the checksum metadata between runs?

mydogteddy commented 2 years ago

I tried to use fdupes to find duplicate films on a 5T drive full of films with many duplicates and it was useless for that purpose unless you were in a very cold place and wanted some warmth from your CPU because fdupes just went on and on forever with 100% CPU

Now with -c fdupes worked superbly well and took less than a minute to find all the many duplicates and as far as I could determine it made no mistakes so for this use-case fdupes -c is excellent.

Prior to this I almost paid good money to buy some proprietary code: God for bid now I can spend it on beer instead.

moonshiner commented 2 years ago

Would it be possible to log all the md5 sums generated during a run as an option?

mydogteddy commented 2 years ago

I have been using fdupes with the -c option for some time now. I find it particularly useful for quickly finding duplicate names in my very large film and music collections. To the best of my knowledge, there is no other free option available which does the same thing as quickly as fdupes -c can do it.

With the -c option I can search 5T of data in just a few seconds to find duplicate names which is perfect for finding duplicate film names etc where absolute content is not so important.

There is a paid-for version that can do the same however it is quite expensive.

I really do think the -c option ought to be included otherwise many other people who just wish to search their film/music collections for only duplicate names will have to pay for a paid-for version.

Including option -c will not detract from any of the other options available in fdupes so it has everything to gain with nothing lost so unless we are just being purists here for no good reason I see no reason not to have the -c option.

lpcvoid commented 2 years ago

I agree - I would love to see this added, as my use case is exactly what was described. I don't care much about the actual content - I care about fast comparisons, and also do not see the harm to give the user this option if he or she may want it.

philipphutterer commented 1 year ago

You could basically use find, sort, and awk for that:
find . -type f -printf '%s %p\n' | sort -V | awk '{if ($1 == s) {print l; c=1} else {if (c) {print l "\n"};c=0} s=$1; l=$0}'

This will print a list of files with equal sizes.

jbruchon commented 1 year ago

This is a fantastic way to lose a lot of data quickly. Don't do this unless you know your data quite intimately.

philipphutterer commented 1 year ago

How can you lose your data this way?

mydogteddy commented 1 year ago

I have been using this option for many months sorting my vast film and music collection and have lost nothing, it works really fast is easy to use and is as far as I am concerned reliable for what I use it for.

If you want to be a die-hard purist then go ahead and try :- find . -type f -printf '%s %p\n' | sort -V | awk '{if ($1 == s) {print l; c=1} else {if (c) {print l "\n"};c=0} s=$1; l=$0}'

jbruchon commented 1 year ago

How can you lose data by assuming identical size equals identical contents and then taking potentially destructive actions based on that assumption? Are you seriously asking me this question?

philipphutterer commented 1 year ago

Okay are we even talking about the same thing? The command I posted is just listing file names with equal sizes, not more not less. No destructive actions, in fact, no actions at all. And as people mentioned above, there are use cases where you might want to have that list of files with equal file sizes. What you want to do with that information is a different story.

jbruchon commented 1 year ago

This tool is used primarily to delete duplicate files. -dN is the most common use case. Now imagine someone sees "faster" in the help text, uses the new option, and it deletes all "duplicate" files that are the same size only. Just because it's not YOUR use case or YOU wouldn't walk into that trap doesn't mean it's not a use case or a trap for someone else less experienced or careful.

https://en.wikipedia.org/wiki/Principle_of_least_astonishment

https://www.jjinux.com/2021/05/add-another-entry-to-unix-haters.html

Also, my response was primarily against the idea in general, not your code in particular.

adrianlopezroche / fdupes

Add option for ignore file content #160