I found an issue while extracting tables from a document using Analyze. My Textract OCR identified proper table with correct Bbox. Now, If I am using the same info and trying to extract text, I am missing some information. Here are the samples for that.
Below Image is the cropped image which I got using Bbox info from textract ocr output.
Analyze Document output (after some postprocessing like including markdowns) : "\n\n | n pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| of climate change issues into of executives | Attainment of "promoting sustainability," including climate change-relat initiatives. reflected in performance-linked remuneration |"
If you observe the image and output clearly, I am getting missed "Internal Carbo" in first row and "Incorporation" and "Remuneration" in second row of 1st cell.
For this, I tried to apply canvas with the page size from which I fetched the table and created below image. Still it is giving me same output.
Now, I went ahead and tried to add thresholding and got this image as an output.
Interestingly, this provided proper output - "\n\n | Internal carbon pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| Incorporation of climate change issues into remuneration of executives | Attainment of "promoting sustainability," including climate change-related initiatives, reflected in performance-linkeo remuneration |"
Here is the way, I created sample threshold -
_, binary_image = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
But there will be problem with colored images, the solution which I proposed won't work as it makes things worst.
This is the issue I found and the some hack, If there is anything interesting rather than this please feel free to post the solution.
I ask AWS team to have a look over and fix this issue.
Hi,
I found an issue while extracting tables from a document using Analyze. My Textract OCR identified proper table with correct Bbox. Now, If I am using the same info and trying to extract text, I am missing some information. Here are the samples for that.
Below Image is the cropped image which I got using Bbox info from textract ocr output.![temp_crop](https://github.com/aws-samples/amazon-textract-code-samples/assets/31265978/ef35d9dd-d148-46e5-a403-a5600348806d)
Analyze Document output (after some postprocessing like including markdowns) : "\n\n | n pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| of climate change issues into of executives | Attainment of "promoting sustainability," including climate change-relat initiatives. reflected in performance-linked remuneration |"
If you observe the image and output clearly, I am getting missed "Internal Carbo" in first row and "Incorporation" and "Remuneration" in second row of 1st cell.
For this, I tried to apply canvas with the page size from which I fetched the table and created below image. Still it is giving me same output.
Now, I went ahead and tried to add thresholding and got this image as an output.![temp_crop (2)](https://github.com/aws-samples/amazon-textract-code-samples/assets/31265978/5f8b8683-0cde-44da-a394-37b0f455026b)
Interestingly, this provided proper output - "\n\n | Internal carbon pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| Incorporation of climate change issues into remuneration of executives | Attainment of "promoting sustainability," including climate change-related initiatives, reflected in performance-linkeo remuneration |"
Here is the way, I created sample threshold - _, binary_image = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
But there will be problem with colored images, the solution which I proposed won't work as it makes things worst.
This is the issue I found and the some hack, If there is anything interesting rather than this please feel free to post the solution. I ask AWS team to have a look over and fix this issue.