Closed glimmbo closed 2 months ago
Hi and thanks for exploring & raising this,
Now that you mention it, I'm not sure if I've actually seen cases yet where we're just reading the multiple output objects direct from S3... Projects I've seen so far have been fetching the results direct from Textract, paginating the GetDocumentAnalysisCommand
by NextToken
once the job is SUCCEEDED
, to produce the array of JSONs.
Under the hood, the error you're seeing is because TRP is trying to parse a block that references 8cc77710-e530-4208-836e-45043dc93411
before it's actually parsed that block: Looks like probably a form KEY_VALUE_SET
block trying to reference a WORD
.
▶️ Does this block ID appear in only one of your JSON files? Or multiple?
▶️ If multiple, could you check the order that you're picking up the files with readdirSync(outputDir).map(...)
? I guess there might be a file referencing this block ID in a Relationships
, before the file that actually defines a block with "Id": "8cc77710-e530-4208-836e-45043dc93411"
?
IF you're seeing references to this ID split across multiple files, and the files being picked up out of order, I suspect the problem would be TRP.js not correctly parsing docs when the response chunks are provided out-of-order.
In that case, I agree we should find a solution in the library but possible temporary solutions could include:
GetDocumentAnalysis
command until NextToken
is not set.
GetDocumentAnalysis
, like this (in Python). Alternatively you could just save the raw response JSONs with filenames that sort alphabetically?Page
number now? Maybe could sort the chunks by ascending order of their first Blocks
' Page
field?)If you're only seeing 8cc77710-e530-4208-836e-45043dc93411
appear in one of your input JSON files, then something weirder is going on & probably a different bug with parsing logic...
Simply sorting the file paths of the output directory did the trick! The complete new TextractDocument
now loads successfully.
I appreciate your detailed response @athewsey. Just to not leave your questions unanswered:
▶️ Does this block ID appear in only one of your JSON files? Or multiple?
Multiple, in adjacent files numerically (9/10) but not in the default sort of readdirSync
▶️ If multiple, could you check the order that you're picking up the files with readdirSync(outputDir).map(...)? I guess there might be a file referencing this block ID in a Relationships, before the file that actually defines a block with "Id": "8cc77710-e530-4208-836e-45043dc93411"?
This was the solution, correcting the order of the array of output file paths ✅
I'm using the NodeJS version, "amazon-textract-response-parser": "^0.4.1"
My process is: 1)
StartDocumentAnalysisCommand
with params2) Poll for completion with
GetDocumentAnalysisCommand
(I realize the cost implications here, working on a POC)3) Download the results from the
outputBucket
4) Parse the downloaded output files and load them with TextractDocument, typecasting the array passed as per the suggestion
But I get this error:
What I take this to mean is that the output from the Textract operation hasn't maintained Block ID consistency across all the files created... though I did see this In most cases message in the
amazon-textract-response-parser
README:Am I missing something in the Textract operation parameters that would fix those IDs? Or is there something else needed when instantiating the
TextractDocument
? Or do I need to pass it the raw, paginated response fromGetDocumentAnalysisCommand
in order to work? I thought that would be strange considering there are mutation functions available withamazon-textract-response-parser
.Thanks in advance.