aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

Instantiating a TextractDocument with a textract result json blob incorrectly prints out a "NextToken" warning #154

Closed jorelllinsangan closed 1 year ago

jorelllinsangan commented 1 year ago

We noticed in our project that Textract was logging warnings for a possibly truncated content when instantiating a TextractDocument

Provided Textract JSON contains a NextToken: Content may be truncated!

I took a closer look at the content we were trying to parse and saw that the key NextToken was part of the JSON blob but is just set to null.

{
   ...
   "NextToken":null, 
   "StatusMessage":null,
   "Warnings":null
}

I found that the constructor of the TextractDocument simply checks if the key exists and logs the warning if it does.

    if ("NextToken" in this._dict) {
      console.warn(`Provided Textract JSON contains a NextToken: Content may be truncated!`);
    }

Probably should be checking for the existence of the key and its value.

athewsey commented 1 year ago

Hi & thanks for raising this!

As mentioned in the linked PR, there's an alpha version 0.3.1-alpha.1 now available on NPM with a draft fix. It would be great if you could try it out and let us know whether it resolves your issue?

If possible I'd also like to fix our TRP-side API type definitions at the same time, as null is unexpected here (per e.g. the GetDocumentAnalysis API doc). Could you confirm whether:

jorelllinsangan commented 1 year ago

Hi thanks for looking into this!

  1. No. We don't have a data pipeline. The results are straight from Textract. We just specify our own location where the results should be written to.

  2. We actually chose not to use GetDocumentAnalysis. We manually download the analysis result from our s3 bucket.

athewsey commented 1 year ago

Thanks for the clarification, that's useful for finding examples.

I've just merged the linked PR and released v0.3.1 to NPM so closing this issue as I believe it's fixed - please do re-open if you find otherwise!