elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate and similar images
https://difpy.readthedocs.io
MIT License
466 stars 67 forks source link

Enhancement: ignore files where file extension is not of a known image type #60

Closed audiomuze closed 1 year ago

audiomuze commented 1 year ago

At present difPy evaluates every file in folders its pointed to, ascertaining whether or not it is an image. When processing folders that contain many files, most of which are not images it means the process takes a lot longer than it needs to as every file is assessed.

It would be very useful to be able to call dif with a parameter that tells it to only compare files where the file is a known image type, thus ignoring all files with extensions that are not associated with images.

elisemercury commented 1 year ago

Hi @audiomuze,

Thanks a lot for the suggestion! Indeed I think this could speed up the process a bit and make difPy more efficient. This will be considered for the next update!

Again thank you and best, Elise

elisemercury commented 1 year ago

@UplandsDynamic, thanks for looking into this issue!

I looked into it as well as part of the update for v3.0.9, but came to the preliminary conclusion that the current implementation might actually be the most accurate one. Let me explain why:

Pillow has no way (at least I didn't find any, maybe you will?) of programmatically accessing which filetypes are supported. Therefore, we would need to manually define which filetypes difPy would proceed with to decode, and which not. This would lead to manual updating of supported filetypes, and should Pillow add or remove some without being updated in difPy, it might lead to files being skipped, which could actually have successfully been decoded by Pillow.

I'm happy to hear if you have another idea or approach that could be implemented here. I think it can be very useful, we just need to find what is the most efficient way to do so. :slightly_smiling_face:

Thanks and best, Elise

UplandsDynamic commented 1 year ago

@elisemercury Ah I see your thinking, yes that could be a problem.

I've almost done implementing that since my last message (it didn't take long), and was about to test. I hardcoded a tuple of common extensions, which would necessarily need to be updated as and when Pillow changes - unless, as you say, Pillow could be programmatically queried for that information (which would be very useful!).

In my implementation, the user may select the option to filter out 'invalid' extensions, using an optional input arg, as per the other features. If that boolean is set to True, the files that do not have a valid extension are removed from play during the directory scanning process. I defined a filter method in the _help class, that returns a list of 'valid' and skipped files; the skipped are then added to the output stats (either just a count, or a list of skipped files, if verbose logging is on).

I'll go ahead and finishing testing and submit a pull request with what I've already done, then have a think about the best way to address the point you raised. As I say, it didn't take long to code, so if you decided not to merge, then no worries! You may well have a better approach to implementation in any case, but my effort will be there if you want it.

Cheers, Dan

audiomuze commented 1 year ago

@UplandsDynamic I think your approach is 100% appropriate. Having the ability to select whether or not to process * gives one the best of both worlds - analyze everything or analyze only known file extensions. I know which switch I'd be choosing.

elisemercury commented 1 year ago

Hi @audiomuze,

Thanks for your feedback - this is very helpful. The feature has been implemented in v3.0.10. Thanks again @UplandsDynamic for your contributions.

Best, Elise