aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

Unable to construct TextractDocument with multi-page output downloaded from S3 #182

Closed glimmbo closed 2 months ago

glimmbo commented 2 months ago

I'm using the NodeJS version, "amazon-textract-response-parser": "^0.4.1"

My process is: 1) StartDocumentAnalysisCommand with params

{
      DocumentLocation: {
        S3Object: {
          Bucket: inputBucket,
          Name: ApplicationPath,
        },
      },
      FeatureTypes: ["TABLES", "FORMS", "LAYOUT"],
      OutputConfig: {
        S3Bucket: outputBucket,
      },
  }

2) Poll for completion with GetDocumentAnalysisCommand (I realize the cost implications here, working on a POC)

async function pollForCompletion({ JobId }: { JobId: string }) {
  const { JobStatus, StatusMessage } = await textractClient.send(
    new GetDocumentAnalysisCommand({
      JobId,
      MaxResults: 1000,
    })
  );

  // 15 second polling
  if (JobStatus === "IN_PROGRESS") {
    console.log("...");
    await delay(15000);
    await pollForCompletion({ JobId });
  } else {
    console.log(`Status: ${JobStatus}`);
    console.log(`Message: ${StatusMessage}`);
  }
}

3) Download the results from the outputBucket

async function getOutputFromS3({ JobId }: { JobId: string }) {
  const outputDir = `textract_output/${JobId}`;

  const { Contents = [] } = await s3Client.send(
    new ListObjectsCommand({ Bucket: outputBucket, Prefix: outputDir })
  );

  await Promise.all(
    Contents.map(async ({ Key }) => {
      if (!Key?.includes(".s3_access_check")) {
        console.log({ Key });

        const cmd = new GetObjectCommand({ Bucket: outputBucket, Key });
        const { Body } = await s3Client.send(cmd);

        const jsonString = await Body?.transformToString();

        mkdirSync(`./${outputDir}`, { recursive: true });
        if (jsonString?.length) writeFileSync(`./${Key}.json`, jsonString);
      }
    })
  );

  return { outputDir };
}

4) Parse the downloaded output files and load them with TextractDocument, typecasting the array passed as per the suggestion

function loadAllOutputFilesIntoTextractDocument({
  outputDir,
}: {
  outputDir: string;
}) {
  const collectedResponses = readdirSync(outputDir).map<
    ApiAsyncJobOuputSucceded[]
  >((filePath) =>
    JSON.parse(readFileSync(`${outputDir}/${filePath}`, { encoding: "utf-8" }))
  );

  return new TextractDocument(collectedResponses as unknown as ApiResponsePage);
}

But I get this error:

loadAllFilesIntoTextractDocument:
Error: Missing parser item for block ID 8cc77710-e530-4208-836e-45043dc93411
    at Page.getItemByBlockId (<removed>/node_modules/amazon-textract-response-parser/src/document.ts:300:13)
    at FieldKeyGeneric.listWords (<removed>/node_modules/amazon-textract-response-parser/src/content.ts:324:38)
    at FieldKeyGeneric.get text [as text] (<removed>/node_modules/amazon-textract-response-parser/src/content.ts:341:19)
    at <removed>/node_modules/amazon-textract-response-parser/src/form.ts:317:34
    at Array.forEach (<anonymous>)
    at new FormGeneric (<removed>/node_modules/amazon-textract-response-parser/src/form.ts:314:15)
    at Page._parse (<removed>/node_modules/amazon-textract-response-parser/src/document.ts:272:18)
    at new Page (<removed>/node_modules/amazon-textract-response-parser/src/document.ts:227:10)
    at <removed>/node_modules/amazon-textract-response-parser/src/document.ts:1495:28
    at Array.forEach (<anonymous>)

What I take this to mean is that the output from the Textract operation hasn't maintained Block ID consistency across all the files created... though I did see this In most cases message in the amazon-textract-response-parser README:

In most cases, providing an array of response objects is also supported (for use when a large Amazon Textract response was split/paginated).

Am I missing something in the Textract operation parameters that would fix those IDs? Or is there something else needed when instantiating the TextractDocument? Or do I need to pass it the raw, paginated response from GetDocumentAnalysisCommand in order to work? I thought that would be strange considering there are mutation functions available with amazon-textract-response-parser.

Thanks in advance.

athewsey commented 2 months ago

Hi and thanks for exploring & raising this,

Now that you mention it, I'm not sure if I've actually seen cases yet where we're just reading the multiple output objects direct from S3... Projects I've seen so far have been fetching the results direct from Textract, paginating the GetDocumentAnalysisCommand by NextToken once the job is SUCCEEDED, to produce the array of JSONs.

Under the hood, the error you're seeing is because TRP is trying to parse a block that references 8cc77710-e530-4208-836e-45043dc93411 before it's actually parsed that block: Looks like probably a form KEY_VALUE_SET block trying to reference a WORD.

▶️ Does this block ID appear in only one of your JSON files? Or multiple?

▶️ If multiple, could you check the order that you're picking up the files with readdirSync(outputDir).map(...)? I guess there might be a file referencing this block ID in a Relationships, before the file that actually defines a block with "Id": "8cc77710-e530-4208-836e-45043dc93411"?


IF you're seeing references to this ID split across multiple files, and the files being picked up out of order, I suspect the problem would be TRP.js not correctly parsing docs when the response chunks are provided out-of-order.

In that case, I agree we should find a solution in the library but possible temporary solutions could include:

  1. Fetching the JSONs directly from Textract by iterating the GetDocumentAnalysis command until NextToken is not set.
    • FWIW, my understanding (see the pricing page) is that more calls to GetDocumentAnalysis shouldn't be a price problem, but a quota problem: You still only analyzed the document once, you're just retrieving the result. The main concern would be the TPS limit on this API.
    • I appreciate it's not ideal because the response is only available through API for a limited time, versus indefinite S3 storage
    • What I've done in the past is to produce my own consolidated JSON dict by merging the in-order paged responses from GetDocumentAnalysis, like this (in Python). Alternatively you could just save the raw response JSONs with filenames that sort alphabetically?
  2. trying to sort the chunks yourself (e.g. I think most blocks report a Page number now? Maybe could sort the chunks by ascending order of their first Blocks' Page field?)

If you're only seeing 8cc77710-e530-4208-836e-45043dc93411 appear in one of your input JSON files, then something weirder is going on & probably a different bug with parsing logic...

glimmbo commented 2 months ago

Simply sorting the file paths of the output directory did the trick! The complete new TextractDocument now loads successfully.

I appreciate your detailed response @athewsey. Just to not leave your questions unanswered:

▶️ Does this block ID appear in only one of your JSON files? Or multiple?

Multiple, in adjacent files numerically (9/10) but not in the default sort of readdirSync

▶️ If multiple, could you check the order that you're picking up the files with readdirSync(outputDir).map(...)? I guess there might be a file referencing this block ID in a Relationships, before the file that actually defines a block with "Id": "8cc77710-e530-4208-836e-45043dc93411"?

This was the solution, correcting the order of the array of output file paths ✅