Open Difegue opened 4 years ago
I wrote a simple perl plugin based on your suggestion:
package LANraragi::Plugin::Scripts::DuplicateFinder;
use strict;
use warnings;
no warnings 'uninitialized';
use LANraragi::Utils::Logging qw(get_plugin_logger);
use LANraragi::Model::Config;
sub plugin_info {
return (
name => "Duplicate Finder",
type => "script",
namespace => "duplfind",
author => "dixonym",
version => "1.0",
description => "Find potential duplicate archives by comparing thumbnail hashes using Hamming distance.",
icon => "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAIAAAAC64paAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAABZSURBVDhPzY5JCgAhDATzSl+e/2irOUjQSFzQog5hhqIl3uBEHPxIXK7oFXwVE+Hj5IYX4lYVtN6MUW4tGw5jNdjdt5bLkwX1q2rFU0/EIJ9OUEm8xquYOQFEhr9vvu2U8gAAAABJRU5ErkJggg==",
oneshot_arg => "hamming distance threshold (defaults to 5)"
);
}
# Hamming distance function
sub hammingdistance {
my ($a, $b) = @_;
my $distance = 0;
# Assuming the thumbhashes are hex strings, convert them to binary
my $binary_a = unpack("B*", pack("H*", $a));
my $binary_b = unpack("B*", pack("H*", $b));
for (my $i = 0; $i < length($binary_a); $i++) {
if (substr($binary_a, $i, 1) ne substr($binary_b, $i, 1)) {
$distance++;
}
}
return $distance;
}
sub run_script {
shift;
my $lrr_info = shift;
my $logger = get_plugin_logger();
my $threshold = $lrr_info->{oneshot_param};
# Check if the threshold is not set or is an empty string, use default value of 5
$threshold = 5 if (!defined($threshold) || $threshold eq '');
# Convert the threshold to an integer
$threshold = int($threshold);
$logger->info("Set Hamming distance threshold to " . $threshold);
my $redis = LANraragi::Model::Config->get_redis;
# Get all archive IDs (40-character long keys only)
my @keys = $redis->keys('????????????????????????????????????????');
# Store thumbhashes
my %thumbhashes;
# Collect thumbhashes for all archives
foreach my $id (@keys) {
my %hash = $redis->hgetall($id);
my $thumbhash = $hash{'thumbhash'};
# Only consider entries that have a thumbhash
if ($thumbhash) {
$thumbhashes{$id} = $thumbhash;
}
}
# Array to store pairs of duplicates
my @duplicates;
# Compare each archive thumbhash with others
foreach my $id1 (keys %thumbhashes) {
foreach my $id2 (keys %thumbhashes) {
next if $id1 eq $id2; # Skip self-comparison
# Calculate Hamming distance
my $distance = hammingdistance($thumbhashes{$id1}, $thumbhashes{$id2});
# Compare distance to the threshold for considering two hashes as duplicates
if ($distance <= $threshold) {
# Log the potential duplicate
$logger->info("Found potential duplicate: $id1 and $id2 with distance $distance");
# Add the pair to the duplicates list
push @duplicates, [$id1, $id2];
}
}
}
# Return list of pairs of potential duplicates
return \@duplicates;
}
It seems to work. I manually checked some of the detected galleries and they are actual duplicates.
But there seems to be a problem with the minion job. After ~4h the minion worker "went away". I'm not quite sure, why that happens?
---
args:
- duplfind
- 0
- ''
attempts: '1'
children: []
created: 2024-10-14T12:36:16.7228Z
delayed: 2024-10-14T12:36:16.7228Z
finished: 2024-10-14T16:06:47.18225Z
id: '47156'
notes: {}
parents: []
priority: '0'
queue: default
result: Worker went away
retried: ~
retries: '0'
started: 2024-10-14T12:36:16.7325Z
state: failed
task: run_plugin
worker: '157'
My library stats:
27019 Archives on record
4176 Different tags existing
732 GB in content folder
All in all it seems quite a hassle to check for duplicates this way. IMO a better approach would be integrating the duplicate checking directly in the UI, similar to stashapp:
This one could be pretty fun, I think.
The script should go through the entire archive list and return a list of potential duplicate pairs at the end.
I see two potential ways to detect dupes:
Compare existing thumbnail hashes computed by LRR:
The hashes already exist in the database since they're used for reverse image searches. This would be the easiest and fastest way to go. Here's some example code I got from who knows where:
Re-extract thumbnails and compare them in detail using a package like https://github.com/runarbu/PerlImageHash.
This would be super expensive computationally speaking, but if the first way doesn't yield decent results I don't see any other solution.