Open rolltidehero opened 8 months ago
In my experience, Dupeguru throws this error when a single file have 1000 or more duplicates (I don't know the exact number), the way I worked around it, is to use https://github.com/qarmin/czkawka (I'm using czkawka_gui_gtk_46) for running the duplicates scan and save the results in a .json file, then use this python script to load and convert that .json into a .dupeguru file.
import json
import xml.etree.ElementTree as ET
def convert_json_to_xml(json_file, xml_file):
# Read JSON data from the input file
with open(json_file, 'r', encoding='utf-8') as f:
data = json.load(f)
# Create the root element of the XML document
results = ET.Element("results")
# Iterate over the data and create XML structure
for size_group in data.values():
for group in size_group:
group_element = ET.SubElement(results, "group")
for file in group:
file_element = ET.SubElement(group_element, "file")
file_element.set("path", file["path"])
file_element.set("words", "")
file_element.set("is_ref", "n")
file_element.set("marked", "n")
# Create an ElementTree object and write it to the XML file
tree = ET.ElementTree(results)
tree.write(xml_file, encoding='utf-8', xml_declaration=True)
# Convert JSON to XML
convert_json_to_xml('czkawka_duplicates.json', 'dupeguru_duplicates.dupeguru')
Then use this other python script to remove all groups of duplicate files that have more than 999 items from the dupeguru converted file.
import xml.etree.ElementTree as ET
def remove_large_groups(input_file, output_file, max_items=999):
# Parse the input XML file
tree = ET.parse(input_file)
root = tree.getroot()
# Iterate over the groups and remove those with more than max_items files
for group in root.findall('group'):
if len(group.findall('file')) > max_items:
root.remove(group)
# Write the modified XML to the output file
tree.write(output_file, encoding='utf-8', xml_declaration=True)
if __name__ == "__main__":
input_file = "dupeguru_duplicates.dupeguru"
output_file = "dupeguru_duplicates_cleaned.dupeguru"
remove_large_groups(input_file, output_file)
Now you should be able to import the saved result into dupeguru just fine.
Describe the bug I get this error when I run a scan and after about 10K files, it will produce the following error:
Desktop (please complete the following information):
Additional context I run a standard scan and just look for filename matches.