elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate or similar images within folders
https://difpy.readthedocs.io
MIT License
419 stars 65 forks source link

Key error during union search if any invalid files #96

Open sna-scourtney opened 1 month ago

sna-scourtney commented 1 month ago

Thanks for creating this wonderful tool. I'm using it to deduplicate photo albums going back to my days with 35mm film. Simple file matching tools won't find photos that were accidentally scanned more than once.

I've found a probable bug, and I'm reporting the bug and a workaround.

In the file dif.py, the function _build_image_dictionaries() has this code at about line 182:

file_nums = [(i, valid_files[i]) for i in range(len(valid_files))]

Just after that there is logic that checks for invalid files and records those, but also adds valid files to the dictionaries. Invalid files are never added to the dictionaries, but the count is incremented.

The result of this is that there can be gaps in the file numbers. The build process works fine, but during the union search phase there will be a key error. When I first encountered this, I thought it must be a duplicate key, but it's actually a missing key.

I added some scaffold code to dump out the filename dictionary and the list of invalid files to an extra scratch log, and I found the numbering gaps.

It's not clear to me whether the correct solution would be to not increment the file count for invalid files, or to put dummy items into the dictionaries in place of invalid files (but keep their filenames in that dictionary for logging?). My workaround has been to clean up or delete the faulty image files, after which re-running the same operation will succeed.

sideshot commented 1 month ago

I had a KeyError because of a few bad files. I moved them out of the folder, and it is working now. They were all under 1kb.

lakenen commented 2 weeks ago

I think this is the same issue?

difPy preparing files: [100%]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "[...]/python/3.11.9/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "[...]/python/3.11.9/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 416, in _find_matches_batch
    tensor_B_list = np.asarray([self.__difpy_obj._tensor_dictionary[x[1]] for x in ids])
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 416, in <listcomp>
    tensor_B_list = np.asarray([self.__difpy_obj._tensor_dictionary[x[1]] for x in ids])
                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
KeyError: 419
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 921, in <module>
    se = search(dif, similarity=args.similarity, rotate=args.rotate, lazy=args.lazy, processes=args.processes, chunksize=args.chunksize)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 252, in __init__
    self.result, self.lower_quality, self.stats = self._main()
                                                  ^^^^^^^^^^^^
  File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 266, in _main
    result = self._search_union()
             ^^^^^^^^^^^^^^^^^^^^
  File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 303, in _search_union
    for output in pool.imap_unordered(self._find_matches_batch, self._yield_comparison_group(), self.__chunksize):
  File "[...]/python/3.11.9/lib/python3.11/multiprocessing/pool.py", line 451, in <genexpr>
    return (item for chunk in result for item in chunk)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/python/3.11.9/lib/python3.11/multiprocessing/pool.py", line 873, in next
    raise value
KeyError: 419

I'm running this on a very large set of photos (~1TB), so there certainly could be some "bad" files if that is the cause for this. Not sure how to go about finding which file(s) caused this. Any suggestions?