Open Get-Ryan opened 1 year ago
I have the same wish as well to better use difpy in my projects.
@elisemercury, just noticed this issue and had a quick look to see how it might be implemented.
I've not coded/tested this yet - just did a very quick code review and noted down the idea, so may not work (and may well have idiotic mistakes!). But if you think it's a valid approach - and want to add this feature - let me know I'll code/test/pull request.
Add an input param to take a list of directories where files should only be checked against files located outwith directories in this list, and assign its value to a new dif class parameter, e.g., self.dupe_free_dirs
Change the _search.exclude_from_search
variable to a dif class parameter and pass that in as an arg to both the _search.matches
and _compute.id_by_location
methods.
Amend line 339 to if (number_B > number_A) and id_B not in self.exclude_from_search
.
Use the existing directory for-loop in _compute.id_by_location
, to check the file locations against directories stored in self.dupe_free_dirs
. If found, add the file ID (once created) to self.exclude_from_search
.
I had implemented a very rough way to do pairwise comparison between folders in the difpy V3 but I don't have the knowledge to do it in V4. This only works for two folders but was useful sometimes if you need to compare a small number of files (500) against a much larger set (20,000) and don't want to run in exponential time. the break point (bp) between folders is hard coded here and is the number of images in the smaller folder.
` def _matches(imgs_matrices, id_by_location, similarity, show_output, show_progress, fast_search):
progress_count = 0
duplicate_count, similar_count = 0, 0
total_count = len(imgs_matrices)
exclude_from_search = []
result = {}
bp=89 #EDIT
for number_A, (id_A, matrix_A) in enumerate(imgs_matrices.items()):
if number_A>bp: #EDIT
break
if show_progress:
_help._show_progress(progress_count, total_count, task='comparing images')
if id_A in exclude_from_search:
progress_count += 1
else:
for number_B, (id_B, matrix_B) in enumerate(imgs_matrices.items()):
if number_B > number_A and number_B>bp-2: #EDIT
rotations = 0
while rotations <= 3:
if rotations != 0:
matrix_B = _help._rotate_img(matrix_B)
try:
mse = _compute._mse(matrix_A, matrix_B)
except:
MSE = 0
if mse <= similarity:
check = False
for key in result.keys():
if id_A in result[key]['matches']:
result[key]['matches'][id_B] = {'location': str(Path(id_by_location[id_B])),
'mse': mse }
check = True
if not check:
if id_A not in result.keys():
result[id_A] = {'location': str(Path(id_by_location[id_A])),
'matches': {id_B: {'location': str(Path(id_by_location[id_B])),
'mse': mse }}}
else:
result[id_A]['matches'][id_B] = {'location': str(Path(id_by_location[id_B])),
'mse': mse }
if show_output:
_help._show_img_figs(matrix_A, matrix_B, mse)
_help._show_file_info(str(Path(id_by_location[id_A])), str(Path(id_by_location[id_B])))
if fast_search == True:
exclude_from_search.append(id_B)
rotations = 4
else:
rotations += 1
progress_count += 1
if similarity > 0:
for id in result:
if similarity > 0:
for matchid in result[id]['matches']:
if result[id]['matches'][matchid]['mse'] > 0:
similar_count += 1
else:
duplicate_count +=1
else:
for id in result:
duplicate_count += len(result[id]['matches'])
return result, exclude_from_search, total_count, duplicate_count, similar_count
`
My apologies if this is already possible, but I didn't find any method within the documentation at https://difpy.readthedocs.io/en/latest/usage.html.
Currently, the multiple folder method within difPy searches for duplicates amongst all the folders listed. However, once you have de-duplicated the images within a folder, if you then look to compare an additional folder against those that you have already de-duplicated, the de-duplicated folder's contents again get compared against themselves.
It would be nice if there was an optional parameter set where a source folder could be set that multiple comparison folders could check against. Basically, difPy would assume the image contents in the source folder are unique, and only need to be processed against duplicates within the provided comparison folders.
I believe this would help in larger image projects like mine. In my scenario, I've downloaded photos from my partner and my phone's and tablets multiple times throughout the years. As I started de-duplicating with difPy, I would move the de-duplicated files into a central project folder and then scan the next photo dump folder against the project folder. Since the project folder already contains unique images, I don't need difPy to check those images against each other again, but I'm not seeing any existing method to do that. I think this would be a noticeable performance improvement, especially as image sets get larger and larger.