arsenetar / dupeguru

Find duplicate files
https://dupeguru.voltaicideas.net
GNU General Public License v3.0
5.32k stars 414 forks source link

RecursionError: maximum recursion depth exceeded while calling a Python object #1188

Open rolltidehero opened 8 months ago

rolltidehero commented 8 months ago

Describe the bug I get this error when I run a scan and after about 10K files, it will produce the following error:

Application Name: dupeGuru
Version: 4.3.1
Python: 3.8.13
Operating System: Windows-10-10.0.22635-SP0

Traceback (most recent call last):
  File "hscommon\gui\progress_window.py", line 107, in pulse
  File "core\app.py", line 332, in _job_error
  File "hscommon\jobprogress\performer.py", line 46, in _async_run
  File "core\app.py", line 816, in do
  File "core\directories.py", line 187, in get_files
  File "core\directories.py", line 108, in _get_files
  File "core\directories.py", line 108, in _get_files
  File "core\directories.py", line 108, in _get_files
  [Previous line repeated 980 more times]
  File "core\directories.py", line 118, in _get_files
  File "core\fs.py", line 404, in get_file
  File "core\fs.py", line 200, in __init__
  File "pathlib.py", line 1042, in __new__
  File "pathlib.py", line 683, in _from_parts
  File "pathlib.py", line 676, in _parse_args
  File "pathlib.py", line 69, in parse_parts
RecursionError: maximum recursion depth exceeded while calling a Python object

image

Desktop (please complete the following information):

Additional context I run a standard scan and just look for filename matches.

AndroYD84 commented 3 months ago

In my experience, Dupeguru throws this error when a single file have 1000 or more duplicates (I don't know the exact number), the way I worked around it, is to use https://github.com/qarmin/czkawka (I'm using czkawka_gui_gtk_46) for running the duplicates scan and save the results in a .json file, then use this python script to load and convert that .json into a .dupeguru file.

import json
import xml.etree.ElementTree as ET

def convert_json_to_xml(json_file, xml_file):
    # Read JSON data from the input file
    with open(json_file, 'r', encoding='utf-8') as f:
        data = json.load(f)

    # Create the root element of the XML document
    results = ET.Element("results")

    # Iterate over the data and create XML structure
    for size_group in data.values():
        for group in size_group:
            group_element = ET.SubElement(results, "group")
            for file in group:
                file_element = ET.SubElement(group_element, "file")
                file_element.set("path", file["path"])
                file_element.set("words", "")
                file_element.set("is_ref", "n")
                file_element.set("marked", "n")

    # Create an ElementTree object and write it to the XML file
    tree = ET.ElementTree(results)
    tree.write(xml_file, encoding='utf-8', xml_declaration=True)

# Convert JSON to XML
convert_json_to_xml('czkawka_duplicates.json', 'dupeguru_duplicates.dupeguru')

Then use this other python script to remove all groups of duplicate files that have more than 999 items from the dupeguru converted file.

import xml.etree.ElementTree as ET

def remove_large_groups(input_file, output_file, max_items=999):
    # Parse the input XML file
    tree = ET.parse(input_file)
    root = tree.getroot()

    # Iterate over the groups and remove those with more than max_items files
    for group in root.findall('group'):
        if len(group.findall('file')) > max_items:
            root.remove(group)

    # Write the modified XML to the output file
    tree.write(output_file, encoding='utf-8', xml_declaration=True)

if __name__ == "__main__":
    input_file = "dupeguru_duplicates.dupeguru"
    output_file = "dupeguru_duplicates_cleaned.dupeguru"
    remove_large_groups(input_file, output_file)

Now you should be able to import the saved result into dupeguru just fine.