elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate or similar images within folders
https://difpy.readthedocs.io
MIT License
420 stars 65 forks source link

Add ability to recurse a directory tree but limit comparison of content to that in each subfolder only #53

Closed audiomuze closed 10 months ago

audiomuze commented 1 year ago

I have a use case whereby I have thousands of folders in a directory tree, with each folder containing one or more images. It would be great to be able to run difPy pointed at the root folder of the tree in question and have it systematically process each folder in the tree, processing each folder having regard only to the image files in that folder.

elisemercury commented 1 year ago

Hi @audiomuze, Thanks for your input! I'm not sure I fully understand your request correctly. DifPy has the option to search among two different folders by setting directory_A and directory_B. If you set these two to root directories, it will search through all their sub directories for matching images. I hope this helps! Otherwise, If I misunderstood, please feel free to clarify and provide an example so that I can look into your request.

Thank you and all the best, Elise

audiomuze commented 1 year ago

Hi @elisemercury

Now I re-read what I'd written I can see it's clear as mud. To clarify, let's assume there are many hundreds of subfolders in a directory tree /home/x/collection/ collection is the "root" folder and each sub-folder contains a variety of files including the following: image

It'd be very useful to be able to run difPy from the collection folder, having it traverse the entire tree and wherever it encounters

image

in the same folder it compares them and deletes the lower quality file.

elisemercury commented 1 year ago

Hi @audiomuze,

Thanks for clarifying! This is indeed supported by difPy. As mentioned above, difPy supports recursive search within the given directory and this setting is activated by default (see parameter recursive in the difPy Usage Documentation). Therefore, by using:

search = dif("../home/x/collection/") 

difPy will search through collection and return all the duplicates it finds. The search is performed within the union of all files in collection and including all files found among sub folders. In your case, it would find folder.jpg and xfolder.jpg as well, if these two image pais are indeed duplicates.

I hope this clarifies!

All the best, Elise

audiomuze commented 1 year ago

Hi @elisemercury,

I'm afraid I'm failing to communicate effectively. What you've described is what I'd understood from the outset and is a perfectly valid behaviour when wanting to find all duplicates that exist in a population regardless of location within the directory tree.

What I'm suggesting is to add the the ability to recurse the directory tree as is presently the case, but to restrict the duplicate identification to one subfolder at a time i.e. return the duplicates (if any) in each subfolder considered in isolation rather than collection as a whole.

This could be done by for example adding a -i / -infolder switch option to difPy which then modifies its behaviour as follows:

image

In the example above I'd like to process the entire directory tree, but when looking for duplicates consider only the contents in each discrete subfolder, not the entire tree. So for each test the population is the contents of the subfolder being processed, not the tree.

So because difPy is invoked in ./ it processes as follows:

At no point does it look across ./ and all its subfolders looking for duplicates. If on the other hand the -i switch is omitted it defaults to its current behaviour of comparing all contents across the entire directory tree.

elisemercury commented 1 year ago

Hi again @audiomuze,

Please excuse the confusion! Now I got your point - thanks for clarifying again! :-)

Unfortunately no, difPy does not support that natively. What you could do instead: run difPy within in a for loop on the folders you want to inspect and append the search.result for each process into a new collection - for instance:

new_result = {}
dir_list = [subfolder1, subfoldera, subfolderb, subfolder2, subfolderx]

for dir in dir_list:
    search = dif(dir)
    new_result[dir] = search.result

You will end up with a new collection that will hold the search results for each subfolder:

{ subfolder1 : {'1234' : ...,
                '5678' : ... },
  subfoldera : {'4321' : ...,
                '8765' : ... },
 ... }

Just tested it myself and it works like a charm :-) Hope this helps!

All the best, Elise

audiomuze commented 1 year ago

Thanks @elisemercury, I'll code a script, but going down this path means I lose the benefits and ease of using difPy's CLI to get the job done.

nbd-phd commented 1 year ago

Could you not just incorporate this feature into the command line interface, make an option that says -U true, false where -U specifies you want the searched directories to be treated as a union or not. then you can implement the code in the python where it either collects directories specified by -r into a single unified search path, or it lists them into an array and then parses them one by one using a for loop. This would be very helpful because I want to mass search directories on my computer that I know do not contain duplicates between folders but almost certainly contain duplicates within each folder, local to that folder only.

audiomuze commented 1 year ago

Not having this option basically renders dif useless to any users wanting to trawl an entire hard drive or directory tree but limit matching only to what is in every folder. I'm not sure I understand the resistance to enhancing dif to cater to both needs. Anyone with for e.g. a large music collection and various iterations of cover art would find this useful, but are unlikely to want want the entire directory tree being processed, because some albums come in different releases and would legitimately have the same cover art.

elisemercury commented 1 year ago

Hi @audiomuze,

Thanks for your feedback. I will be looking into this in the near future and see whether this is a feasible feature. Let me come back to you soon.

Thanks and best, Elise

audiomuze commented 11 months ago

Hi @elisemercury, have you has a chance to assess feasibility of implementing this capability?

elisemercury commented 10 months ago

Hi @audiomuze,

Please excuse the delay in coming back to you on this request! This feature will be added in the next difPy version which will be released shortly.

Thank you and best, Elise

elisemercury commented 10 months ago

Hi @audiomuze,

I'm happy to let you know tha this feature has been implemented in difPy v4. Check out the in_folder parameter. Thanks for your suggestion!

All the best, Elise