Open bcaradima opened 1 year ago
Regarding hashes look blocky and simplistic; you can easily change the hash size and create the plots to get more feeling and intuition how "blocky" your use case needs.
model.compute_hash(hash_size=32)
# Plot the hash
model.plot_hash(filenames=model.results['filenames'][model.results['select_idx'][0]])
Thanks, I managed a manual fix before updating by editing the group
function in undouble.py
based on this SO thread. I can also confirm the issue is fixed with this update.
Reviewing the documentation, I can't see a straightforward way to extract the full paths to the files in a group. Is there a built-in function to do this or does it require digging into the model object? Thanks!
Great to hear!
I added an example to the docs regarding your question. This was your question, right?
Yes it was. Thank you for your help!
Hello again! I've tried using my same script applying undouble
to a much larger matching task; in this case, finding matches for ~140 compressed photos in a wider set of ~12,400 of high-quality photos. When I run the script I get:
py_run_file("1-undouble.py")
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> Extracting images from: [outputs\Fisheries]
[undouble] >INFO> [12385] files are collected recursively from path: [outputs\Fisheries]
[undouble] >INFO> [12385] images are extracted.
[undouble] >INFO> Reading and checking images.
[undouble] >INFO> Reading and checking images.
[clustimage]: 100%|██████████| 12385/12385 [52:00<00:00, 3.97it/s]
Error: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (12385,) + inhomogeneous part.
Strangely, this is the same error as before! I'm not sure if this is where the issue occurs, but I checked the source code for the group
function and it seems to include the fix mentioned in this SO thread (i.e., setting numpy arrays to dtype=object
), so I'm not really sure what the issue is. I'm running undouble-1.2.10
and numpy-1.24.3
.
Any ideas/suggestions would be welcome!
PS: it looks like this error may arise from the clustimage
package
Which version of clustimage are you using? If it is not 1.5.17, ty to force update to the latest one.
import clustimage
clustimage.__version__
Force update clustimage
pip install -U clustimage
Thanks Erdogant, I've forced an update to clustimage
version 1.5.18 with
pip install -U --trusted-host pypi.org --trusted-host files.pythonhosted.org clustimage
Due to issues with trusted SSL certificates. The issue above has been resolved.
I'm now applying the same workflow on a much larger set of images (~231k images totalling 187GB), but I'm running into a different error:
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage] [undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage] [undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage] [undouble] >INFO> Extracting images from: [D:/photos/Wildlife] [undouble] >INFO> [231496] files are collected recursively from path: [D:/photos/Wildlife] [undouble] >INFO> [231496] images are extracted. [undouble] >INFO> Reading and checking images. [undouble] >INFO> Reading and checking images. 1-undouble: Computing image hashes... 100%|██████████| 231486/231486 [02:33<00:00, 1510.89it/s] [undouble] >INFO> Compute adjacency matrix [231486x231486] with absolute differences based on the image-hash of [ahash]. Error: AttributeError: 'bool' object has no attribute 'sum'
I'm not sure if this error is due to undouble
or clustimage
(if you'd like I can repost this as a new issue on the clustimage
repository).
Nice fix with the trusted host!
That are a lot of files though. And an interesting error. Error: AttributeError: 'bool' object has no attribute 'sum'
Do you have an idea why this error would occur only with a large set of images?
Can there be corrupt images? Or maybe thumbnail images or so. Can you post the input parameters that you are using?
I agree that it is quite a large collection of files, but undouble
did run without issue on an initial set of ~12.5k of photos. The full script that I run is:
import os
import pandas as pd
from undouble import Undouble
dir_in = "inputs"
dir_out = "outputs"
# LOCATE DATA
#' get filenames and paths to input data
# dir_photos = os.path.join(dir_in, "example", "Photos")
# IMAGE MATCHING
#' initialize a 'model'
#' use hash_size argument to increase hash uniqueness
model = Undouble(method = "ahash", hash_size = 8)
#' import the photos
model.import_data(targetdir = dir_photos)
print("1-undouble: Computing image hashes...")
#' compute image hash
model.compute_hash()
print("1-undouble: Computing image hashes... DONE")
model.group(threshold = 0)
where dir_photos
is a path to the photos stored on an external drive. The full output from the script is:
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> filepath is set to [C:\Users\BCARAD~1\AppData\Local\Temp\clustimage]
[undouble] >INFO> Extracting images from: [D:/photos/Wildlife]
[undouble] >INFO> [231496] files are collected recursively from path: [D:/photos/Wildlife]
[undouble] >INFO> [231496] images are extracted.
[undouble] >INFO> Reading and checking images.
[undouble] >INFO> Reading and checking images.
[clustimage]: 0%| | 893/231496 [05:29<9:08:21, 7.01it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Anuk_Butte-Site Visit_Ecosystems-habitat plots_2021-Aug\Plot cards\Datasheet Photo Archive\ANU-WLD-05_20210819-File Corrupt-add to RNQA-18042.jpg]
[clustimage]: 0%| | 896/231496 [05:29<7:29:11, 8.56it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Anuk_Butte-Site Visit_Ecosystems-habitat plots_2021-Aug\Plot cards\Datasheet Photo Archive\ANU-WLD-08_20210823_file-corrupt.jpg]
[clustimage]: 0%| | 902/231496 [05:30<8:13:32, 7.79it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Anuk_Butte-Site Visit_Ecosystems-habitat plots_2021-Aug\Plot cards\Datasheet Photo Archive\GAL-WLD-02_20210818.jpg]
[clustimage]: 0%| | 913/231496 [05:32<9:55:45, 6.45it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Anuk_Butte-Site Visit_Ecosystems-habitat plots_2021-Aug\Plot cards\Datasheet Photo Archive\POR-WLD-02_20210820.jpg]
[clustimage]: 2%|▏ | 3885/231496 [23:38<31:56:59, 1.98it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Grizzly Bear Hair Snag\Trip 6\Site photos\2022-08-28\P8281268.JPG]
[clustimage]: 18%|█▊ | 40528/231496 [2:02:45<20:54:59, 2.54it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Wildlife Camera and ARU\PORC-CAM14\2023-06-13\Site Photos\field_89662866564893e9577e61.jpg]
[clustimage]: 68%|██████▊ | 157860/231496 [5:28:32<2:05:29, 9.78it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Wildlife Camera and ARU\WMC-CAM18\2022-05-30\Camera Photos\IMG_0021.JPG]
[clustimage]: 68%|██████▊ | 157866/231496 [5:28:32<1:48:52, 11.27it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Wildlife Camera and ARU\WMC-CAM18\2022-05-30\Camera Photos\IMG_0027.JPG]
[clustimage]: 68%|██████▊ | 157868/231496 [5:28:33<1:35:17, 12.88it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\Wildlife Camera and ARU\WMC-CAM18\2022-05-30\Camera Photos\IMG_0030.JPG]
[clustimage]: 100%|█████████▉| 231174/231496 [7:28:03<01:20, 3.98it/s][undouble] >WARNING> Could not read: [D:/photos/Wildlife\X Incidental Wildlife Observations\2022\2022-Feb-23\Mess creek wolf.jpg]
[clustimage]: 100%|██████████| 231496/231496 [7:30:10<00:00, 8.57it/s]
[undouble] >INFO> [10] Corrupt image(s) removed.
1-undouble: Computing image hashes...
100%|██████████| 231486/231486 [02:33<00:00, 1510.89it/s]
[undouble] >INFO> Compute adjacency matrix [231486x231486] with absolute differences based on the image-hash of [ahash].
Error: AttributeError: 'bool' object has no attribute 'sum'
So it does include some corrupt images. Due to the size of the photo directory, I'm not certain but I imagine there are thumbnail images and duplicates scattered throughout the folder.
I also just realized that the paths might be constructed incorrectly, because I build the path in R and pass it to Python using the reticulate
package. Since the script takes a few hours to run on so many photos, do you think its worth trying to correct the paths and re-run?
Thanks!
Thanks for the additional information. I would for sure make a small function that first checks all paths and images. That is an easy and fast check.
I am thinking of reconstructing the code and do the same, to first check all pathnames en whether all images are accessible and not corrupted. Waiting for hours and then finding out the very last image is corrupt is a pity
For a checkup, you may want to resize the images to a very small frame (25x25) or so. I guess that would result in a very fast run. If that works, proceed with your current parameters.
I checked the documentation for resizing, but it doesn't provide an example of usage. Is this a one-line function call that is done in memory? I just hoping that resizing all ~231k images at once can be done with relative ease...
Maybe I should rename this input parameter. It is now named dim
. Creating an example in the docs would indeed be beneficial (its on my todo list).
# Import library
from undouble import Undouble
model = Undouble(dim=(25,25)
Strangely, I am getting the old error (presumably from clustimage
) about subsetting arrays again:
Error: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (231648,) + inhomogeneous part.
I checked the main script for clustimage
(below) and it seems to be missing the fix recommended in this SO thread.
"C:\Users\bcaradima\AppData\Local\r-miniconda\envs\r-reticulate\Lib\site-packages\clustimage\clustimage.py"
Would it be possible to apply the fix and push the changes?
I am confused now. Can you point out more specifically where it (again) breaks? And which lines should have been changed but is not?
I'm also a bit confused 😄 , but I'll try to clarify what's going on.
Earlier, I mentioned encountering this same error and found this SO thread which mentioned a simple fix. Rather than waiting and updating the packages, I manually edited the Python scripts for undouble
and clustimage
directly, and changed instances where Numpy arrays were subsetted from:
numpy.array([1.2, "abc"])
to something like this:
numpy.array([1.2, "abc"], dtype=object)
This seemed to fix the error, but now I'm running undouble-1.2.10
and clustimage-1.5.18
without any manual edits to the scripts and the same error is coming up again. I think the one of these packages might be the issue; specifically, the error arises with:
model.group(threshold = 0)
So either the issue is either in undouble's group
function or maybe group
is calling something in clustimage
which causes the error to arise?
I did some more poking around to figure out where the error is coming from. I ran the script line-by-line and found the error above occurs with:
model.import_data(targetdir = dir_photos)
Reviewing the source code of undouble
, it appears that import_data
calls a function in clustimage
with the same name. I would assume this is where the subsetting error is coming from. But I'm still puzzled why it hasn't been an issue in the past.
I always try to re-use functionalities from my other libraries because it helps in lowering the maintenance burden and fixing bugs. The import_data
function is one of them. At the moment I am abroad and can’t do debugging.
But if you for example did fix the bug for the group issue, please feel free to push it! This would be really helpful.
I did find the bug of the inhomogeneous shape and fixed it! It happened during the computation of one of the hashes. The output array has been changed so this caused the issue. Try to update to the latest version of clustimage and then run again!
pip install -U clustimage
Hello, I'am facing similar problem
[undouble] >INFO> [171595] images are extracted.
[undouble] >INFO> Reading and checking images.
[undouble] >INFO> Reading and checking images.
[clustimage]: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 171595/171595 [38:40<00:00, 73.96it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 171595/171595 [01:28<00:00, 1942.33it/s]
[undouble] >INFO> Compute adjacency matrix [171595x171595] with absolute differences based on the image-hash of [phash].
Traceback (most recent call last):
File "/home/fferreira/Workspace/images/src/image_processing.py", line 45, in <module>
Main()
File "/home/fferreira/Workspace/images/src/image_processing.py", line 41, in Main
find_duplicated()
File "/home/fferreira/Workspace/images/src/image_processing.py", line 12, in find_duplicated
model.compute_hash()
File "/home/fferreira/.pyenv/versions/image_analysis/lib/python3.10/site-packages/undouble/undouble.py", line 218, in compute_hash
self.results['adjmat'] = (self.results['img_hash_bin'][:, None, :] != self.results['img_hash_bin']).sum(2)
AttributeError: 'bool' object has no attribute 'sum'
Versions
clustimage == 1.5.2
imagehash == 4.3.1
numpy == 1.23.5
undouble == 1.2.11
I've checket whether the images were corrupted, but it does not seem the case.
Here is my code
from undouble import Undouble
from PIL import Image
import glob
import os
from tqdm.auto import tqdm
IMAGE_FOLDER = '../data/'
def find_duplicated():
model = Undouble()
model.import_data(IMAGE_FOLDER)
model.compute_hash()
model.group(threshold=0)
model.move()
return model
def check_corrupted_image(image:str) -> bool:
try:
im = Image.open(image)
im.verify()
im.close()
except Exception as err:
print(f' {image} -> {err} ')
return False
return True
def remove_corrupted_images():
print('Check for corrupted files')
corrupted = []
for path in tqdm(glob.glob(f"{IMAGE_FOLDER}**", recursive=True)):
if '.png' in path:
if check_corrupted_image(path) == False:
corrupted.append(path)
print(corrupted)
for path in corrupted:
os.remove(path)
print(f'{path} removed')
def Main():
remove_corrupted_images()
find_duplicated()
if __name__ == '__main__':
Main()
Ok, I've made some digging and I guess the problem is with numpy. It seems, the != operator change its behaviour from element-wise operator to full object comparision for large arrays. This way :
self.results['img_hash_bin'][:, None, :] != self.results['img_hash_bin']
will return a bolean when self.results['img_hash_bin'] has more than 20000 elements. At least, this is my experimental results.
I've tried to change the != for a xor-operator. Then, the program breaks complaining not having 1.1 Tb of memory and breaks,
Does it make sense?
upgrading to 1.1TB memory is always the solution ;)
But can you maybe try this line instead? I can not reproduce the error but this may can work.
self.result = np.not_equal(self.results['img_hash_bin'][:, None, :], self.results['img_hash_bin']).sum(2)
Hi, I'm interested in matching some compressed photos (extracted from Excel files) to their original high-quality copy, and I came across your package. I've given it a try, setting up a new environment with Miniconda/Python3 and installing
undouble
via Pip. I followed the workflow outlined in your documentation by importing and computing the image hashes:However,
model.group
returns an error:It appears to be related to numpy array manipulation; I'm using Python 3.8 and numpy version
1.24.3
. I've verified that numpy is up-to-date, but I'm wondering if there's a specific version thatundouble
needs to work? Thank you for your help.BTW, does it make sense to store the compressed images in with the rest of the photos, or would you suggest an alternative approach?
Thanks again.