Closed robertskmiles closed 3 years ago
Relatedly, if a file has an image file extension but is actually empty, you get this traceback:
Traceback (most recent call last):
File "/home/rob/.local/bin/memery", line 8, in <module>
sys.exit(__main__())
File "/home/rob/.local/lib/python3.9/site-packages/memery/cli.py", line 23, in __main__
app()
File "/usr/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/home/rob/.local/lib/python3.9/site-packages/memery/cli.py", line 16, in search_folder
ranked = queryFlow(path, query=query)
File "/home/rob/.local/lib/python3.9/site-packages/memery/core.py", line 54, in queryFlow
dbpath, treepath = indexFlow(root)
File "/home/rob/.local/lib/python3.9/site-packages/memery/core.py", line 30, in indexFlow
new_embeddings = image_encoder(crafted_files, device)
File "/home/rob/.local/lib/python3.9/site-packages/memery/encoder.py", line 17, in image_encoder
for images, labels in tqdm(img_loader):
File "/home/rob/.local/lib/python3.9/site-packages/tqdm/std.py", line 1133, in __iter__
for obj in iterable:
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1065, in _next_data
return self._process_data(data)
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/home/rob/.local/lib/python3.9/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
PIL.UnidentifiedImageError: Caught UnidentifiedImageError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/rob/.local/lib/python3.9/site-packages/memery/crafter.py", line 48, in __getitem__
sample = self.loader(path)
File "/home/rob/.local/lib/python3.9/site-packages/memery/crafter.py", line 26, in pil_loader
img = Image.open(f)
File "/usr/lib/python3.9/site-packages/PIL/Image.py", line 2967, in open
raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file <_io.BufferedReader name='dual/8c6598d6ef98f5eaeba38dbaf5d381b0.jpg'>
Which is good in that it gives the file path, but bad in that it also crashes out the whole process instead of just skipping the empty file
thank you! You're right, it needs better error handling especially in the indexFlow which can take a long time to run. did you get it to work on your data set after removing the offending files?
As for not editing the Python files directly, that's because this is developed with nbdev. So that patch will work fine, just needs to be added in the Jupyter notebook that is the source of truth for the code, rather than in the finished .py files. I guess I should put some stuff in the read me about how to help develop on the project...
I haven't been able to figure out a clean way to implement skipping over broken files, so it's a slow process of running it until it fails, deleting the problem file, and trying again. The latest issue is actually the Decompression Bomb error from issue #10 , with an image which is very large but isn't actually corrupted or invalid so I don't want to just delete it
Yes, these are all connected due to Pillow preferring to error out rather than silently drop images. I think it shouldn't matter to the index if we drop images, since this happens before the index is written. So it's just a matter of ignoring PIL errors and passing on to the next image.
Maybe there's a way to check the image file for corruption in the loader? I'm work on this today some
Okay @robertskmiles I think i have fixed this in 0.0.7.
There's not an elegant way to verify an image inside of a list comprehension, it seems. So I just made a try/except loop like yours above -- except I put it in the loader
instead of the crafter
, so that corrupted files never even make it in the door. See 5ed4f4856c46786a68ad4fa9fd07a4f2954bae2c near the bottom
Let me know if this runs against your datasets and if it does we can close the issue! thanks
Strike that, better to put it in pil_loader
in the crafter module. It was taking far too long to check the foders for image files the other way. Now it only checks image files right before crafting them, so only one time each, rather than once each time it boots up.
Pushing this to pypi soon as 0.0.8 but then i'm going to start developing mostly on github and releasing less often, i think. I'm not really sure how to maintain a package but i think having a develop zone and a release zone makes sense...
Well no that made it slow also. The compromise was to put it in the archive_loader
as a boolean check on the new_files
coming in. It's not as fast as it once was for encoding, but much faster than checking all the images before crafting them> And the pil_loader
can't be allowed to return None because it's used in the DataLoader
which wants batches of already cleaned data.
Anyway, I'll close this once I've tested on a couple other machines btu it think it's as good as can get right now.
Regarding list comprehensions, might it be possible to use a pattern like: [x for x in list_of_xs if x]
So then list_of_xs
can have None
values and they'll just be skipped over?
Or [x for x in list_of_xs if is_valid_image(x)]
?
Yes :D that's essentially what i did, with this line in loader
:
new_files = [(str(path), slug) for path, slug in filepaths if slug not in archive_slugs and verify_image(path)]
It is kind of annoying that it prints the errors every time now. And if there's an errored file it re-builds the treemap because it thinks it needs to re-index some new images, even though 0 images will actually be passed. Hmm...
this has been ~solved so i'm closing, we can open new issues for similar issues
I ran memery on a large directory of images, and after some time it failed:
This stack trace is very extensive, but doesn't tell me the one piece of information I actually want, which is the name of the file which has broken memery.
As a test I modified
pil_loader
incrafter.py
to look like this:and that worked to tell me the problematic file, which I could delete. But the
crafter.py
file says not to to edit it, so this isn't a usable patch. Also it should probably skip over the bad file and keep going rather than crashing out, especially since this process takes a long time, and isn't able to pick up where it left off if you have to restart it.