IanLee1521 / utilities

Utility scripts for useful tasks.
MIT License
17 stars 17 forks source link

fuzzy find-duplicates #4

Open openbrian opened 2 years ago

openbrian commented 2 years ago

Ian,

Just saw your PyCon lightning talk from 2016 where you mention find-duplicates. I was wondering if it has a fuzzy search, meaning you copied a folder of files, and then changes some of those files. For example, you make a "backup" folder and you have backups of backups.

I'm looking for a way to identify similar folders. Do you think find-duplicates can do anything like this?

Thanks, Brian

IanLee1521 commented 2 years ago

Hi @openbrian - Nice to hear from you.

Do you have a more concrete example of what you're thinking?

I don't think it is fuzzy in a way that I would expect, but it does work on a per file basis, which means that it will help identify two directories that share a lot of common content. Namely, when you run it if you have have two sub-directories where there is a lot of overlap, you'll see that in the output.

When I originally wrote this, I was trying to clean up from having as many as 6 copies of some pictures in my photo library due to some copying and backing up that went weird, so it definitely helped me in that way.

Hope that helps answer your question, if it doesn't, please let me know.

openbrian commented 2 years ago

Ian,

It sounds like the use cases are different. You know the 6 folders that have similar content. For me, I want an app that finds similar folders. I will then inspect the folders manually, or use diff -r, or whatever.

What I'm looking for is a way to characterize a folder and use this characterization and some sort of distance function to find similar folders. It would be like the opposite of hashing, as a hash dramatically changes for each small difference.

I'm going to check out your code anyway.

Cheers, Brian

Brian

Brian DeRocher

On October 25, 2021 2:31:11 PM EDT, Ian Lee @.***> wrote:

Hi @openbrian - Nice to hear from you.

Do you have a more concrete example of what you're thinking?

I don't think it is fuzzy in a way that I would expect, but it does work on a per file basis, which means that it will help identify two directories that share a lot of common content. Namely, when you run it if you have have two sub-directories where there is a lot of overlap, you'll see that in the output.

When I originally wrote this, I was trying to clean up from having as many as 6 copies of some pictures in my photo library due to some copying and backing up that went weird, so it definitely helped me in that way.

Hope that helps answer your question, if it doesn't, please let me know.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/IanLee1521/utilities/issues/4#issuecomment-951192824

IanLee1521 commented 2 years ago

Ah ok, I think what I'm describing does do that, but in a very manual, "you as the user have to stare at the output and make sense of it via pattern matching" sort of way.

If you find a way to do what you're describing though, this could provide the start of the code to do that. If you wanted to submit a pull request, I'll definitely take a look at it.