erdogant / undouble

Python package undouble is to detect (near-)identical images.
BSD 3-Clause "New" or "Revised" License
44 stars 0 forks source link

? Error with `Undouble.group` function and numpy array handling #9

Open bcaradima opened 1 year ago

bcaradima commented 1 year ago

Hi, I'm interested in matching some compressed photos (extracted from Excel files) to their original high-quality copy, and I came across your package. I've given it a try, setting up a new environment with Miniconda/Python3 and installing undouble via Pip. I followed the workflow outlined in your documentation by importing and computing the image hashes:

import os
from pathlib import Path
from undouble import Undouble

dir_in = "inputs"
dir_out = "outputs"

#' directory including all photos (including compressed ones)
dir_photos = os.path.join(dir_in, "Photos")

#' initialize with default settings
model = Undouble(method = "ahash")

#' import the Excel photos
model.import_data(targetdir = dir_photos)

#' compute image hash
model.compute_hash()

#' hashes look blocky and simplistic; are they good enough for matching?
model.plot_hash(idx = 10)

#' group images
model.group(threshold = 0)

However, model.group returns an error:

Traceback (most recent call last):

  File ~\AppData\Local\miniconda3\envs\env_undouble\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File c:\users\bcaradima\projects\1-imager.py:59
    model.group(threshold = 0)

  File ~\AppData\Local\miniconda3\envs\env_undouble\lib\site-packages\undouble\undouble.py:257 in group
    self.results['select_pathnames'] = np.array(pathnames)[idx].tolist()

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6,) + inhomogeneous part.

It appears to be related to numpy array manipulation; I'm using Python 3.8 and numpy version 1.24.3. I've verified that numpy is up-to-date, but I'm wondering if there's a specific version that undouble needs to work? Thank you for your help.

BTW, does it make sense to store the compressed images in with the rest of the photos, or would you suggest an alternative approach?

Thanks again.

erdogant commented 1 year ago

Thank you for the notification! You are right, something has been changed in numpy which causes this error. I fixed it! Update to the latest version with:

pip install -U undouble

erdogant commented 1 year ago

Regarding hashes look blocky and simplistic; you can easily change the hash size and create the plots to get more feeling and intuition how "blocky" your use case needs.

model.compute_hash(hash_size=32)
# Plot the hash
model.plot_hash(filenames=model.results['filenames'][model.results['select_idx'][0]])
bcaradima commented 1 year ago

Thanks, I managed a manual fix before updating by editing the group function in undouble.py based on this SO thread. I can also confirm the issue is fixed with this update.

Reviewing the documentation, I can't see a straightforward way to extract the full paths to the files in a group. Is there a built-in function to do this or does it require digging into the model object? Thanks!

erdogant commented 1 year ago

Great to hear!

I added an example to the docs regarding your question. This was your question, right?

bcaradima commented 1 year ago

Yes it was. Thank you for your help!

bcaradima commented 1 year ago

Hello again! I've tried using my same script applying undouble to a much larger matching task; in this case, finding matches for ~140 compressed photos in a wider set of ~12,400 of high-quality photos. When I run the script I get:

py_run_file("1-undouble.py")

[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> Extracting images from: [outputs\Fisheries]
[undouble] >INFO> [12385] files are collected recursively from path: [outputs\Fisheries]
[undouble] >INFO> [12385] images are extracted.
[undouble] >INFO> Reading and checking images.
[undouble] >INFO> Reading and checking images.
[clustimage]: 100%|██████████| 12385/12385 [52:00<00:00,  3.97it/s]
Error: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (12385,) + inhomogeneous part.

Strangely, this is the same error as before! I'm not sure if this is where the issue occurs, but I checked the source code for the group function and it seems to include the fix mentioned in this SO thread (i.e., setting numpy arrays to dtype=object), so I'm not really sure what the issue is. I'm running undouble-1.2.10 and numpy-1.24.3.

Any ideas/suggestions would be welcome!

PS: it looks like this error may arise from the clustimage package

erdogant commented 1 year ago

Which version of clustimage are you using? If it is not 1.5.17, ty to force update to the latest one.

import clustimage
clustimage.__version__

Force update clustimage pip install -U clustimage

bcaradima commented 11 months ago

Thanks Erdogant, I've forced an update to clustimage version 1.5.18 with

pip install -U --trusted-host pypi.org --trusted-host files.pythonhosted.org clustimage

Due to issues with trusted SSL certificates. The issue above has been resolved.

I'm now applying the same workflow on a much larger set of images (~231k images totalling 187GB), but I'm running into a different error:

[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage] [undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage] [undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage] [undouble] >INFO> Extracting images from: [D:/photos/Wildlife] [undouble] >INFO> [231496] files are collected recursively from path: [D:/photos/Wildlife] [undouble] >INFO> [231496] images are extracted. [undouble] >INFO> Reading and checking images. [undouble] >INFO> Reading and checking images. 1-undouble: Computing image hashes... 100%|██████████| 231486/231486 [02:33<00:00, 1510.89it/s] [undouble] >INFO> Compute adjacency matrix [231486x231486] with absolute differences based on the image-hash of [ahash]. Error: AttributeError: 'bool' object has no attribute 'sum'

I'm not sure if this error is due to undouble or clustimage (if you'd like I can repost this as a new issue on the clustimage repository).

erdogant commented 11 months ago

Nice fix with the trusted host!

That are a lot of files though. And an interesting error. Error: AttributeError: 'bool' object has no attribute 'sum' Do you have an idea why this error would occur only with a large set of images? Can there be corrupt images? Or maybe thumbnail images or so. Can you post the input parameters that you are using?

bcaradima commented 11 months ago

I agree that it is quite a large collection of files, but undouble did run without issue on an initial set of ~12.5k of photos. The full script that I run is:

import os
import pandas as pd
from undouble import Undouble

dir_in = "inputs"
dir_out = "outputs"

# LOCATE DATA

#' get filenames and paths to input data
# dir_photos = os.path.join(dir_in, "example", "Photos")

# IMAGE MATCHING

#' initialize a 'model'
#' use hash_size argument to increase hash uniqueness
model = Undouble(method = "ahash", hash_size = 8)

#' import the photos
model.import_data(targetdir = dir_photos)

print("1-undouble: Computing image hashes...")

#' compute image hash
model.compute_hash()

print("1-undouble: Computing image hashes... DONE")

model.group(threshold = 0)

where dir_photos is a path to the photos stored on an external drive. The full output from the script is:

[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> Extracting images from: [D:/photos/Wildlife]
[undouble] >INFO> [231496] files are collected recursively from path: [D:/photos/Wildlife]
[undouble] >INFO> [231496] images are extracted.
[undouble] >INFO> Reading and checking images.
[undouble] >INFO> Reading and checking images.
[clustimage]:   0%|          | 893/231496 [05:29<9:08:21,  7.01it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Anuk_Butte-Site Visit_Ecosystems-habitat plots_2021-Aug\Plot cards\Datasheet Photo Archive\ANU-WLD-05_20210819-File Corrupt-add to RNQA-18042.jpg]
[clustimage]:   0%|          | 896/231496 [05:29<7:29:11,  8.56it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Anuk_Butte-Site Visit_Ecosystems-habitat plots_2021-Aug\Plot cards\Datasheet Photo Archive\ANU-WLD-08_20210823_file-corrupt.jpg]
[clustimage]:   0%|          | 902/231496 [05:30<8:13:32,  7.79it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Anuk_Butte-Site Visit_Ecosystems-habitat plots_2021-Aug\Plot cards\Datasheet Photo Archive\GAL-WLD-02_20210818.jpg]
[clustimage]:   0%|          | 913/231496 [05:32<9:55:45,  6.45it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Anuk_Butte-Site Visit_Ecosystems-habitat plots_2021-Aug\Plot cards\Datasheet Photo Archive\POR-WLD-02_20210820.jpg]
[clustimage]:   2%|▏         | 3885/231496 [23:38<31:56:59,  1.98it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Grizzly Bear Hair Snag\Trip 6\Site photos\2022-08-28\P8281268.JPG]
[clustimage]:  18%|█▊        | 40528/231496 [2:02:45<20:54:59,  2.54it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Wildlife Camera and ARU\PORC-CAM14\2023-06-13\Site Photos\field_89662866564893e9577e61.jpg]
[clustimage]:  68%|██████▊   | 157860/231496 [5:28:32<2:05:29,  9.78it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Wildlife Camera and ARU\WMC-CAM18\2022-05-30\Camera Photos\IMG_0021.JPG]
[clustimage]:  68%|██████▊   | 157866/231496 [5:28:32<1:48:52, 11.27it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Wildlife Camera and ARU\WMC-CAM18\2022-05-30\Camera Photos\IMG_0027.JPG]
[clustimage]:  68%|██████▊   | 157868/231496 [5:28:33<1:35:17, 12.88it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Wildlife Camera and ARU\WMC-CAM18\2022-05-30\Camera Photos\IMG_0030.JPG]
[clustimage]: 100%|█████████▉| 231174/231496 [7:28:03<01:20,  3.98it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\X Incidental Wildlife Observations\2022\2022-Feb-23\Mess creek wolf.jpg]
[clustimage]: 100%|██████████| 231496/231496 [7:30:10<00:00,  8.57it/s]
[undouble] >INFO> [10] Corrupt image(s) removed.
1-undouble: Computing image hashes...
100%|██████████| 231486/231486 [02:33<00:00, 1510.89it/s]
[undouble] >INFO> Compute adjacency matrix [231486x231486] with absolute differences based on the image-hash of [ahash].
Error: AttributeError: 'bool' object has no attribute 'sum'

So it does include some corrupt images. Due to the size of the photo directory, I'm not certain but I imagine there are thumbnail images and duplicates scattered throughout the folder.

I also just realized that the paths might be constructed incorrectly, because I build the path in R and pass it to Python using the reticulate package. Since the script takes a few hours to run on so many photos, do you think its worth trying to correct the paths and re-run?

Thanks!

erdogant commented 11 months ago

Thanks for the additional information. I would for sure make a small function that first checks all paths and images. That is an easy and fast check.

I am thinking of reconstructing the code and do the same, to first check all pathnames en whether all images are accessible and not corrupted. Waiting for hours and then finding out the very last image is corrupt is a pity

erdogant commented 11 months ago

For a checkup, you may want to resize the images to a very small frame (25x25) or so. I guess that would result in a very fast run. If that works, proceed with your current parameters.

bcaradima commented 11 months ago

I checked the documentation for resizing, but it doesn't provide an example of usage. Is this a one-line function call that is done in memory? I just hoping that resizing all ~231k images at once can be done with relative ease...

erdogant commented 11 months ago

Maybe I should rename this input parameter. It is now named dim. Creating an example in the docs would indeed be beneficial (its on my todo list).


# Import library
from undouble import Undouble
model = Undouble(dim=(25,25)
bcaradima commented 11 months ago

Strangely, I am getting the old error (presumably from clustimage) about subsetting arrays again:

Error: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (231648,) + inhomogeneous part.

I checked the main script for clustimage (below) and it seems to be missing the fix recommended in this SO thread.

"C:\Users\bcaradima\AppData\Local\r-miniconda\envs\r-reticulate\Lib\site-packages\clustimage\clustimage.py"

Would it be possible to apply the fix and push the changes?

erdogant commented 11 months ago

I am confused now. Can you point out more specifically where it (again) breaks? And which lines should have been changed but is not?

bcaradima commented 11 months ago

I'm also a bit confused 😄 , but I'll try to clarify what's going on.

Earlier, I mentioned encountering this same error and found this SO thread which mentioned a simple fix. Rather than waiting and updating the packages, I manually edited the Python scripts for undouble and clustimage directly, and changed instances where Numpy arrays were subsetted from:

numpy.array([1.2, "abc"])

to something like this:

numpy.array([1.2, "abc"], dtype=object)

This seemed to fix the error, but now I'm running undouble-1.2.10 and clustimage-1.5.18 without any manual edits to the scripts and the same error is coming up again. I think the one of these packages might be the issue; specifically, the error arises with:

model.group(threshold = 0)

So either the issue is either in undouble's group function or maybe group is calling something in clustimage which causes the error to arise?

bcaradima commented 11 months ago

I did some more poking around to figure out where the error is coming from. I ran the script line-by-line and found the error above occurs with:

model.import_data(targetdir = dir_photos)

Reviewing the source code of undouble, it appears that import_data calls a function in clustimage with the same name. I would assume this is where the subsetting error is coming from. But I'm still puzzled why it hasn't been an issue in the past.

erdogant commented 11 months ago

I always try to re-use functionalities from my other libraries because it helps in lowering the maintenance burden and fixing bugs. The import_data function is one of them. At the moment I am abroad and can’t do debugging.

But if you for example did fix the bug for the group issue, please feel free to push it! This would be really helpful.

erdogant commented 11 months ago

I did find the bug of the inhomogeneous shape and fixed it! It happened during the computation of one of the hashes. The output array has been changed so this caused the issue. Try to update to the latest version of clustimage and then run again!

pip install -U clustimage

fernandoferreira-me commented 5 months ago

Hello, I'am facing similar problem

[undouble] >INFO> [171595] images are extracted.
[undouble] >INFO> Reading and checking images.
[undouble] >INFO> Reading and checking images.
[clustimage]: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 171595/171595 [38:40<00:00, 73.96it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 171595/171595 [01:28<00:00, 1942.33it/s]
[undouble] >INFO> Compute adjacency matrix [171595x171595] with absolute differences based on the image-hash of [phash].
Traceback (most recent call last):
  File "/home/fferreira/Workspace/images/src/image_processing.py", line 45, in <module>
    Main()
  File "/home/fferreira/Workspace/images/src/image_processing.py", line 41, in Main
    find_duplicated()
  File "/home/fferreira/Workspace/images/src/image_processing.py", line 12, in find_duplicated
    model.compute_hash()
  File "/home/fferreira/.pyenv/versions/image_analysis/lib/python3.10/site-packages/undouble/undouble.py", line 218, in compute_hash
    self.results['adjmat'] = (self.results['img_hash_bin'][:, None, :] != self.results['img_hash_bin']).sum(2)
AttributeError: 'bool' object has no attribute 'sum'

Versions

clustimage == 1.5.2
imagehash == 4.3.1
numpy == 1.23.5
undouble ==  1.2.11

I've checket whether the images were corrupted, but it does not seem the case.

Here is my code

from undouble import Undouble
from PIL import Image
import glob
import os
from tqdm.auto import tqdm

IMAGE_FOLDER = '../data/'

def find_duplicated():
    model = Undouble()
    model.import_data(IMAGE_FOLDER)
    model.compute_hash()
    model.group(threshold=0)
    model.move()
    return model

def check_corrupted_image(image:str) -> bool:
    try:
        im = Image.open(image)
        im.verify()
        im.close()
    except Exception as err:
        print(f' {image} -> {err} ')
        return False
    return True

def remove_corrupted_images():
    print('Check for corrupted files')
    corrupted = []
    for path in tqdm(glob.glob(f"{IMAGE_FOLDER}**", recursive=True)):
        if '.png' in path:
            if check_corrupted_image(path) == False:
                corrupted.append(path)
    print(corrupted)
    for path in corrupted:
        os.remove(path)
        print(f'{path} removed')

def Main():
    remove_corrupted_images()
    find_duplicated()

if __name__ == '__main__':
    Main()
fernandoferreira-me commented 5 months ago

Ok, I've made some digging and I guess the problem is with numpy. It seems, the != operator change its behaviour from element-wise operator to full object comparision for large arrays. This way :

self.results['img_hash_bin'][:, None, :] != self.results['img_hash_bin'] will return a bolean when self.results['img_hash_bin'] has more than 20000 elements. At least, this is my experimental results.

I've tried to change the != for a xor-operator. Then, the program breaks complaining not having 1.1 Tb of memory and breaks,

Does it make sense?

erdogant commented 5 months ago

upgrading to 1.1TB memory is always the solution ;)

But can you maybe try this line instead? I can not reproduce the error but this may can work. self.result = np.not_equal(self.results['img_hash_bin'][:, None, :], self.results['img_hash_bin']).sum(2)