elisemercury / Duplicate-Image-Finder

difPy - Python package for finding duplicate or similar images within folders
https://difpy.readthedocs.io
MIT License
420 stars 65 forks source link

Enhancement - Optional parameter set for source folder / comparison folder mode #72

Open Get-Ryan opened 1 year ago

Get-Ryan commented 1 year ago

My apologies if this is already possible, but I didn't find any method within the documentation at https://difpy.readthedocs.io/en/latest/usage.html.

Currently, the multiple folder method within difPy searches for duplicates amongst all the folders listed. However, once you have de-duplicated the images within a folder, if you then look to compare an additional folder against those that you have already de-duplicated, the de-duplicated folder's contents again get compared against themselves.

It would be nice if there was an optional parameter set where a source folder could be set that multiple comparison folders could check against. Basically, difPy would assume the image contents in the source folder are unique, and only need to be processed against duplicates within the provided comparison folders.

I believe this would help in larger image projects like mine. In my scenario, I've downloaded photos from my partner and my phone's and tablets multiple times throughout the years. As I started de-duplicating with difPy, I would move the de-duplicated files into a central project folder and then scan the next photo dump folder against the project folder. Since the project folder already contains unique images, I don't need difPy to check those images against each other again, but I'm not seeing any existing method to do that. I think this would be a noticeable performance improvement, especially as image sets get larger and larger.

dedefelrodrigues commented 1 year ago

I have the same wish as well to better use difpy in my projects.

UplandsDynamic commented 1 year ago

@elisemercury, just noticed this issue and had a quick look to see how it might be implemented.

I've not coded/tested this yet - just did a very quick code review and noted down the idea, so may not work (and may well have idiotic mistakes!). But if you think it's a valid approach - and want to add this feature - let me know I'll code/test/pull request.

  1. Add an input param to take a list of directories where files should only be checked against files located outwith directories in this list, and assign its value to a new dif class parameter, e.g., self.dupe_free_dirs

  2. Change the _search.exclude_from_search variable to a dif class parameter and pass that in as an arg to both the _search.matches and _compute.id_by_location methods.

  3. Amend line 339 to if (number_B > number_A) and id_B not in self.exclude_from_search.

  4. Use the existing directory for-loop in _compute.id_by_location, to check the file locations against directories stored in self.dupe_free_dirs. If found, add the file ID (once created) to self.exclude_from_search.

jdoe1917 commented 8 months ago

I had implemented a very rough way to do pairwise comparison between folders in the difpy V3 but I don't have the knowledge to do it in V4. This only works for two folders but was useful sometimes if you need to compare a small number of files (500) against a much larger set (20,000) and don't want to run in exponential time. the break point (bp) between folders is hard coded here and is the number of images in the smaller folder.

` def _matches(imgs_matrices, id_by_location, similarity, show_output, show_progress, fast_search):

Function that searches the images on duplicates/similarity matches

    progress_count = 0
    duplicate_count, similar_count = 0, 0
    total_count = len(imgs_matrices)
    exclude_from_search = []
    result = {}

    bp=89 #EDIT
    for number_A, (id_A, matrix_A) in enumerate(imgs_matrices.items()):
        if number_A>bp: #EDIT
            break
        if show_progress:
            _help._show_progress(progress_count, total_count, task='comparing images')
        if id_A in exclude_from_search:
            progress_count += 1
        else:
            for number_B, (id_B, matrix_B) in enumerate(imgs_matrices.items()):
                if number_B > number_A and number_B>bp-2: #EDIT
                    rotations = 0
                    while rotations <= 3:
                        if rotations != 0:
                            matrix_B = _help._rotate_img(matrix_B)
                        try:
                            mse = _compute._mse(matrix_A, matrix_B)
                        except:
                            MSE = 0
                        if mse <= similarity:
                            check = False
                            for key in result.keys():
                                if id_A in result[key]['matches']:
                                    result[key]['matches'][id_B] = {'location': str(Path(id_by_location[id_B])),
                                                                    'mse': mse }  
                                    check = True
                            if not check:                                      
                                if id_A not in result.keys():
                                    result[id_A] = {'location': str(Path(id_by_location[id_A])),
                                                    'matches': {id_B: {'location': str(Path(id_by_location[id_B])),
                                                                        'mse': mse }}}
                                else:
                                    result[id_A]['matches'][id_B] = {'location': str(Path(id_by_location[id_B])),
                                                                    'mse': mse }
                            if show_output:
                                _help._show_img_figs(matrix_A, matrix_B, mse)
                                _help._show_file_info(str(Path(id_by_location[id_A])), str(Path(id_by_location[id_B])))
                            if fast_search == True:
                                exclude_from_search.append(id_B)
                            rotations = 4
                        else:
                            rotations += 1
            progress_count += 1

    if similarity > 0:
        for id in result:
            if similarity > 0:
                for matchid in result[id]['matches']:
                    if result[id]['matches'][matchid]['mse'] > 0:
                        similar_count += 1
                    else:
                        duplicate_count +=1        
    else:
        for id in result:
            duplicate_count += len(result[id]['matches'])
    return result, exclude_from_search, total_count, duplicate_count, similar_count

`