deepfates / memery

Search over large image datasets with natural language and computer vision!
https://deepfates.com/memery
MIT License
518 stars 27 forks source link

Unhelpful error message on bad image file #13

Closed robertskmiles closed 3 years ago

robertskmiles commented 3 years ago

I ran memery on a large directory of images, and after some time it failed:

rob@tortuga ~/Dropbox/Camera Uploads$ memery . 'music' --n 5
Loaded 0 encodings
Encoding 25632 new images
...
31/201 [0 15%|██████████▎                                                        | 31/201 [06:00<32:56, 11.63s/it]
Traceback (most recent call last):
  File "/home/rob/.local/bin/memery", line 8, in <module>
    sys.exit(__main__())
  File "/home/rob/.local/lib/python3.9/site-packages/memery/cli.py", line 23, in __main__
    app()
  File "/usr/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/rob/.local/lib/python3.9/site-packages/memery/cli.py", line 16, in search_folder
    ranked = queryFlow(path, query=query)
  File "/home/rob/.local/lib/python3.9/site-packages/memery/core.py", line 54, in queryFlow
    dbpath, treepath = indexFlow(root)
  File "/home/rob/.local/lib/python3.9/site-packages/memery/core.py", line 30, in indexFlow
    new_embeddings = image_encoder(crafted_files, device)
  File "/home/rob/.local/lib/python3.9/site-packages/memery/encoder.py", line 17, in image_encoder
    for images, labels in tqdm(img_loader):
  File "/home/rob/.local/lib/python3.9/site-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1065, in _next_data
    return self._process_data(data)
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/home/rob/.local/lib/python3.9/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/rob/.local/lib/python3.9/site-packages/memery/crafter.py", line 43, in __getitem__
    sample = self.loader(path)
  File "/home/rob/.local/lib/python3.9/site-packages/memery/crafter.py", line 27, in pil_loader
    return img.convert('RGB')
  File "/usr/lib/python3.9/site-packages/PIL/Image.py", line 904, in convert
    self.load()
  File "/usr/lib/python3.9/site-packages/PIL/ImageFile.py", line 249, in load
    raise OSError(
OSError: image file is truncated (11 bytes not processed)

This stack trace is very extensive, but doesn't tell me the one piece of information I actually want, which is the name of the file which has broken memery.

As a test I modified pil_loader in crafter.py to look like this:

def pil_loader(path: str) -> Image.Image:
    # open path as file to avoid ResourceWarning (https://github.com/python-pillow/Pillow/issues/835)
    with open(path, 'rb') as f:
        img = Image.open(f)
        try:
            return img.convert('RGB')
        except OSError as e:
            print("Failed to convert file '%s'" % path)
            raise e

and that worked to tell me the problematic file, which I could delete. But the crafter.py file says not to to edit it, so this isn't a usable patch. Also it should probably skip over the bad file and keep going rather than crashing out, especially since this process takes a long time, and isn't able to pick up where it left off if you have to restart it.

robertskmiles commented 3 years ago

Relatedly, if a file has an image file extension but is actually empty, you get this traceback:

Traceback (most recent call last):
  File "/home/rob/.local/bin/memery", line 8, in <module>
    sys.exit(__main__())
  File "/home/rob/.local/lib/python3.9/site-packages/memery/cli.py", line 23, in __main__
    app()
  File "/usr/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/rob/.local/lib/python3.9/site-packages/memery/cli.py", line 16, in search_folder
    ranked = queryFlow(path, query=query)
  File "/home/rob/.local/lib/python3.9/site-packages/memery/core.py", line 54, in queryFlow
    dbpath, treepath = indexFlow(root)
  File "/home/rob/.local/lib/python3.9/site-packages/memery/core.py", line 30, in indexFlow
    new_embeddings = image_encoder(crafted_files, device)
  File "/home/rob/.local/lib/python3.9/site-packages/memery/encoder.py", line 17, in image_encoder
    for images, labels in tqdm(img_loader):
  File "/home/rob/.local/lib/python3.9/site-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1065, in _next_data
    return self._process_data(data)
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/home/rob/.local/lib/python3.9/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
PIL.UnidentifiedImageError: Caught UnidentifiedImageError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/rob/.local/lib/python3.9/site-packages/memery/crafter.py", line 48, in __getitem__
    sample = self.loader(path)
  File "/home/rob/.local/lib/python3.9/site-packages/memery/crafter.py", line 26, in pil_loader
    img = Image.open(f)
  File "/usr/lib/python3.9/site-packages/PIL/Image.py", line 2967, in open
    raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file <_io.BufferedReader name='dual/8c6598d6ef98f5eaeba38dbaf5d381b0.jpg'>

Which is good in that it gives the file path, but bad in that it also crashes out the whole process instead of just skipping the empty file

deepfates commented 3 years ago

thank you! You're right, it needs better error handling especially in the indexFlow which can take a long time to run. did you get it to work on your data set after removing the offending files?

As for not editing the Python files directly, that's because this is developed with nbdev. So that patch will work fine, just needs to be added in the Jupyter notebook that is the source of truth for the code, rather than in the finished .py files. I guess I should put some stuff in the read me about how to help develop on the project...

robertskmiles commented 3 years ago

I haven't been able to figure out a clean way to implement skipping over broken files, so it's a slow process of running it until it fails, deleting the problem file, and trying again. The latest issue is actually the Decompression Bomb error from issue #10 , with an image which is very large but isn't actually corrupted or invalid so I don't want to just delete it

deepfates commented 3 years ago

Yes, these are all connected due to Pillow preferring to error out rather than silently drop images. I think it shouldn't matter to the index if we drop images, since this happens before the index is written. So it's just a matter of ignoring PIL errors and passing on to the next image.

Maybe there's a way to check the image file for corruption in the loader? I'm work on this today some

deepfates commented 3 years ago

Okay @robertskmiles I think i have fixed this in 0.0.7.

There's not an elegant way to verify an image inside of a list comprehension, it seems. So I just made a try/except loop like yours above -- except I put it in the loader instead of the crafter, so that corrupted files never even make it in the door. See 5ed4f4856c46786a68ad4fa9fd07a4f2954bae2c near the bottom

Let me know if this runs against your datasets and if it does we can close the issue! thanks

deepfates commented 3 years ago

Strike that, better to put it in pil_loader in the crafter module. It was taking far too long to check the foders for image files the other way. Now it only checks image files right before crafting them, so only one time each, rather than once each time it boots up.

Pushing this to pypi soon as 0.0.8 but then i'm going to start developing mostly on github and releasing less often, i think. I'm not really sure how to maintain a package but i think having a develop zone and a release zone makes sense...

deepfates commented 3 years ago

Well no that made it slow also. The compromise was to put it in the archive_loader as a boolean check on the new_files coming in. It's not as fast as it once was for encoding, but much faster than checking all the images before crafting them> And the pil_loader can't be allowed to return None because it's used in the DataLoader which wants batches of already cleaned data.

Anyway, I'll close this once I've tested on a couple other machines btu it think it's as good as can get right now.

robertskmiles commented 3 years ago

Regarding list comprehensions, might it be possible to use a pattern like: [x for x in list_of_xs if x] So then list_of_xs can have None values and they'll just be skipped over? Or [x for x in list_of_xs if is_valid_image(x)] ?

deepfates commented 3 years ago

Yes :D that's essentially what i did, with this line in loader: new_files = [(str(path), slug) for path, slug in filepaths if slug not in archive_slugs and verify_image(path)]

deepfates commented 3 years ago

It is kind of annoying that it prints the errors every time now. And if there's an errored file it re-builds the treemap because it thinks it needs to re-index some new images, even though 0 images will actually be passed. Hmm...

deepfates commented 3 years ago

this has been ~solved so i'm closing, we can open new issues for similar issues