elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate and similar images
https://difpy.readthedocs.io
MIT License
465 stars 67 forks source link

MemoryError #80

Closed radry closed 1 month ago

radry commented 1 year ago

Aparently there is no memory limit built in and it will eat as much as it can get from windows. "Preparing Files" completes fine but when searching the differences it eats a lot of memory. My Windows is set up to automatically manage the pagefile and will happily enlarge it until the drive it's located on is full. When that happens following error appears:

Traceback (most recent call last):
  File "G:\_DOWNLOADS\Duplicate-Image-Finder-4.0.1\difPy\dif.py", line 642, in <module>
    se = search(dif, similarity=args.similarity)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\_DOWNLOADS\Duplicate-Image-Finder-4.0.1\difPy\dif.py", line 242, in __init__
    self.result = self._main()
                  ^^^^^^^^^^^^
  File "G:\_DOWNLOADS\Duplicate-Image-Finder-4.0.1\difPy\dif.py", line 284, in _main
    items.append((id_a, id_b, self.__difpy_obj._tensor_dictionary[id_a], self.__difpy_obj._tensor_dictionary[id_b]))
MemoryError

The program will continue to run but stop without any further error after some minutes. There is no log file produced.

How to reproduce: Run difPy in command line (similarity s=90) on a directory with ~60.000 images with 16GB ram and limited hard drive space for the pagefile. Let windows manage the pagefile size. Wait for "preparing files" to complete (will take an hour or so).

I don't know what would happen if the pagefile has a fixed size. I assume the same error will appear.

System: Windows 10 16GB Ram difPy 4.0.1

breengles commented 1 year ago

same here on Ubuntu, even with 64GB of ram

KalyaSc commented 1 year ago

Same here on windows 10 32GB Ram difPy 4.0.1

Process SpawnPoolWorker-21:
Traceback (most recent call last):
  File "C:\Python311\Lib\multiprocessing\pool.py", line 131, in worker
    put((job, i, result))
  File "C:\Python311\Lib\multiprocessing\queues.py", line 371, in put
    obj = _ForkingPickler.dumps(obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python311\Lib\multiprocessing\process.py", line 314, in _bootstrap
    self.run()
  File "C:\Python311\Lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Python311\Lib\multiprocessing\pool.py", line 134, in worker
    util.debug("Possible encoding error while sending result: %s" % (
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError

It's still going with 94-98% memory usage. Is it going to finish the process or should I just stop stressing my RAM? @breengles @radry

breengles commented 1 year ago

@KalyaSc I was not able to get it running with large image collection so I just switched to other tools for now. Hopefully, the solution will be found sooner

elisemercury commented 11 months ago

Hi @radry

Thanks a lot for flagging this issue. This indeed is not intended behaviour and a fix for it will be implemented in the upcoming difPy release. Stay tuned!

Thanks again and best, Elise

elisemercury commented 9 months ago

Hi all,

difPy v4.1.0 now comes with improved handling of larger datasets, see the guide.

Additionally, the new version lets you adjust the number of processes when Multiprocessing, so in order to reduce memory overhead, you can now manually lower this value. Previously, difPy was set to always use os.cpu_count(). For more details, I can recommend checking the updated documentation of this feature. Nonetheless, keep in mind that lowering the number of simultaneous processes will also lead to a loss in performance, hence computation times will be longer.

Let me know if this helps or if you're still encountering issues.

Best, Elise