Open matwerber1 opened 5 years ago
Found the issue...
Function extractTextBody()
in textract_util.py assumes that a scanned page will have child blocks (e.g. LINES, WORDS) in the relationships
property. However, if a scanned document has a blank page, no relationships
key will be present.
Here's an example of the PAGE block throwing my error from a blank scanned page:
{
'BlockType':'PAGE',
'Geometry':{
'BoundingBox':{
'Width':1.0,
'Height':1.0,
'Left':0.0,
'Top':0.0
},
'Polygon':[
{
'X':1.0,
'Y':0.622184693813324
},
{
'X':0.3762473165988922,
'Y':1.0
},
{
'X':0.0,
'Y':0.3771035075187683
},
{
'X':0.6218673586845398,
'Y':0.0
}
]
},
'Id':'25bfd28a-ddea-40db-9d90-9c3237510dc5',
'Page':10
}
In my example, I am testing with docs provided by financial institutions that occasionally have blank pages at the end of a document or between sections. There's no easy way to avoid these blank pages so suggestion is that the extractTextBody()
be modified to check whether the relationships
key is present before proceeding.
Hi,
I'm running the demo project with a PDF and receiving the error below. I'm working on investigating, but wanted to open the issue for tracking.