aws-samples / amazon-textract-code-samples

Amazon Textract Code Samples
MIT No Attribution
406 stars 263 forks source link

Textract Analyze document for Tables issue #56

Open krishna-vicky96 opened 2 months ago

krishna-vicky96 commented 2 months ago

Hi,

I found an issue while extracting tables from a document using Analyze. My Textract OCR identified proper table with correct Bbox. Now, If I am using the same info and trying to extract text, I am missing some information. Here are the samples for that.

Below Image is the cropped image which I got using Bbox info from textract ocr output. temp_crop

Analyze Document output (after some postprocessing like including markdowns) : "\n\n | n pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| of climate change issues into of executives | Attainment of "promoting sustainability," including climate change-relat initiatives. reflected in performance-linked remuneration |"

If you observe the image and output clearly, I am getting missed "Internal Carbo" in first row and "Incorporation" and "Remuneration" in second row of 1st cell.

For this, I tried to apply canvas with the page size from which I fetched the table and created below image. Still it is giving me same output.

temp_crop (1)

Now, I went ahead and tried to add thresholding and got this image as an output. temp_crop (2)

Interestingly, this provided proper output - "\n\n | Internal carbon pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| Incorporation of climate change issues into remuneration of executives | Attainment of "promoting sustainability," including climate change-related initiatives, reflected in performance-linkeo remuneration |"

Here is the way, I created sample threshold - _, binary_image = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)

But there will be problem with colored images, the solution which I proposed won't work as it makes things worst.

This is the issue I found and the some hack, If there is anything interesting rather than this please feel free to post the solution. I ask AWS team to have a look over and fix this issue.