aws-samples / amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding
MIT No Attribution
92 stars 26 forks source link

[Bug] - Handle non-PDF input documents #18

Closed d-v-dlee closed 2 years ago

d-v-dlee commented 2 years ago

trying to run this solution (branch lmv2 on jpg inputs will cause an error. two files need to be updated. submitting an issue instead of PR since this is based on lmv2 v main branch.

Required changes:

1. preprocess/inference.py

update the SINGLE_IMAGE_CONTENT_TYPES dictionary on line 520 to include "image/jpg":'JPG"

2. src/code/inference.py

update logic for thumbnails to fix the logger message of "Thumbnails expected either array of PNG bytestrings or 4D images array. ". after the logging message add the following code:

if thumbnails.ndim == 3:
    logger.info('Resizing thumbnail of dimension 3 to dimension 4.')
    thumbnails = np.expand_dims(thumbnails, axis=0)

the not images logic also needs to be updated on line 428 and 445.

on line 428, the change is from if processor and not images: to if processor and images is None:. Otherwise it the error will say the comparison with a numpy array is ambigious.

Similariy, on line 445, it must be changed from **({"images": images} if images and processor else {}), to **({"images": images} if images is not None and processor else {}),

athewsey commented 2 years ago

Thanks David!

I think the issue with wrong ndims should have been happening only when the thumbnailer endpoint returns an image variable instead of images, so have pushed the fix in https://github.com/aws-samples/amazon-textract-transformer-pipeline/commit/de7ac69f820468af0305369b86355b898e60bafc rather than editing both sides of the if page_num is None condition.

From a quick test seems like this should fix the pipeline (up until A2I review of course, which only supports PDFs for now) - but let me know if there's a case I missed!