BestImageViewer / geeqie

claiming to be the best image viewer / photo collection browser
http://www.geeqie.org/
GNU General Public License v2.0
479 stars 79 forks source link

Command-line utility for duplicate image finding and processing #1520

Open porridge opened 1 month ago

porridge commented 1 month ago

Setup (please complete the following information):

What is your question

Would you be open to shipping an additional binary that reuses the code in similar.cc for comparing images, but is meant for automated or semi-automated workflows to find and eliminate duplicate images?

Additional context

I find the duplicate image logic in geeqie to be of excellent quality. 🀩

However the GUI interface is not convenient/flexible enough for my use case. I was looking for a tool such as findimagedupes which makes it possible to plug in code that can decide which files in a set to delete, however I found its comparison logic to be less than stellar.

After dusting off my C++ knowledge (and commenting out a few lines in similar.cc), I managed to come up with an alternative frontend to your code, that - although a bit naive/inefficient - works exactly as I imagined and started saving me literally hours of time per year.

This code currently weighs 155 lines of C++, and 50 or so lines of example Python code to automate the decision of whether to delete files or ask for confirmation.

I would like to share it with others, and I can either:

  1. start my own project and copy similar.* with minimal modifications over,
  2. ask you to add it to geeqie 😊

I'm asking this preliminary question first, rather than send a PR straight away, because I would need to spend some more time to properly do option processing before sending a well-formed PR. I'm happy to take a stab at this, but would rather avoid doing so if the answer is "no" anyway πŸ˜„

caclark commented 1 month ago

From my point view, adding extra command line options is not a problem. If people are going to use such a feature it would be better for them to do it via Geeqie rather than via another project.

Although, if implemented, there will be a configuration option to disable it, I am not sure about the implications of making python a dependency. Perhaps @gusnan has an opinion.

The only python dependency in the project so far is involved with running tests, so is not part of the distribution.

porridge commented 1 month ago

Sorry, I should have pointed out that the Python part is completely optional. Basically how the main program works is that it runs some command once for each group of duplicates (passing them as arguments). By default the command is just echo but you can specify anything you like.

I just happened to use Python to create a script that accepts the filenames as parameters, and if their names match certain patterns (in my case this hints on their source) deletes smaller files without asking for confirmation.

porridge commented 1 month ago

FTR, here is the standalone version. I'll try to come up with a PR against geeqie soon. I understand you'd prefer to make this a mode of geeqie itself, rather than a seperate executable? πŸ€”

caclark commented 1 month ago

. I understand you'd prefer to make this a mode of geeqie itself, rather than a seperate executable?

It is your choice if you want a separate project/executable. I think it would be useful if it were available within Geeqie.

main.cc and remote.cc are probably the only files affected.

Please note CODING.md, but ignore the bits that everyone else ignores. When you make a pull request to Geeqie on GitHub, static tests are automatically run. You can run the tests locally first by ./scripts/test-all.sh

The Help file/User Manual will need an update. I can do that.

[I have noticed that very occasionally the similarity algorithm gets it completely wrong. I am not sure I would use an automatic delete without a visual check. But everyone has a choice....]

porridge commented 1 month ago

. I understand you'd prefer to make this a mode of geeqie itself, rather than a seperate executable?

It is your choice if you want a separate project/executable. I think it would be useful if it were available within Geeqie.

I agree about integrating it in the Geeqie project. I was asking about whether it should be an option within /usr/bin/geeqie or a separate executable (e.g. /usr/bin/geeqie-duplicate-finder or such)?

caclark commented 1 month ago

Ah, sorry. I was assuming an additional option:

geeqie --remote --duplicates-program=<prog name> --duplicates-threshold=<n> --duplicates-files=<file list> geeqie -r -p <prog name> -t <n> -m <file list>

Maybe the -m option is not necessary - just take whatever file list is appended on the command line.

Geeqie now has bash command line completion which makes things easier with long-options.

[I am working on making Geeqie a GtkAppliction, which will make command line processing easier. The --remote option will disappear]

Keeping everything together makes distribution easier.

virtadpt commented 1 month ago

@caclark I was really hoping it would be implemented this way.

porridge commented 1 month ago

Maybe the -m option is not necessary - just take whatever file list is appended on the command line.

I having trouble figuring out how to get access to this "remainder" file list from within the callback I added for --duplicates-program πŸ€” Can you maybe provide some hints?

caclark commented 1 month ago

The command line processing is a mess...

The attached diff might help you to get something to work in some way or other, but it is not really suitable to put in the repo.

Run ./build/src/geeqie from one terminal window, and from a second terminal window run: ./build/src/geeqie -r <some file list> ./build/src/geeqie -r -m

I will work on making the command line processing more logical.

remote-duplicates-1.diff.gz