Support for Textract async uploads

meketadev commented 4 years ago

Describe the bug After installing the predictions plugin to identify text for documents, and uploading a PDF, an error occurs: "Error: Unsupported document format".

To Reproduce With a new ionic project ionic start blank (add main and polyfill code) amplify init amplify add auth (defaults) amplify add predictions -> identify -> Identify Text -> "Would you also like to identify documents?" Yes -> Auth and Guest users

try to identify a pdf see error

import { Component } from '@angular/core';
import Predictions from '@aws-amplify/predictions';

@Component({
  selector: 'app-home',
  templateUrl: 'home.page.html',
  styleUrls: ['home.page.scss'],
})
export class HomePage {

  constructor() {}

  onFileChange(event) {
    const file = event.target.files[0];
    console.log(file)
    Predictions.identify({
      text: {
        source: {
          file
        },
        format: "ALL"
      }
    }).then(event => {
      console.log(event)
    })
  }

}

Expected behavior Textract has an async function (start_document_analysis) for dealing with pdfs. I do not believe the current implementation allows for this. However, the "Would you also like to identify documents?" option seems like a question to allow for this.

Additional context I'm not sure if this is creating a lambda function to handle the function. I can't seems to find the resource to switch to the async function.

elorzafe commented 4 years ago

@meketadev Predictions category currently supports synchronous operations for Amazon Textract.

That question from Amplify CLI you mentioned Would you also like to identify documents?is basically for enabling Amazon Textract that allows detecting document like images (picture from a document that contains multiple words) rather than a street sign or images that contains a small number of words.

manueliglesias commented 4 years ago

@meketadev

We'll be flagging this as a feature request, at the moment, Amplify's predictions support for text recognition only works with the synchronous operations as you pointed out.

boogietimeproductions commented 3 years ago

Any update on this request?

the wording of the documentation is somewhat confusing: https://docs.amplify.aws/lib/predictions/identify-text/q/platform/js

"Services used: Amazon Rekognition (default for plain text) and Amazon Textract (default for documents)"

WhatsApp Image 2021-08-15 at 12 24 51 PM

lucasforbes commented 2 years ago

Any update on being able to upload PDFs via amplify Predictions?

DarylBeattie commented 1 year ago

I really wish this worked, but seeing as it looks like a very simple fix and it's been 3.5 years, I suppose we shouldn't hold our breath.

I guess the best course of action here is to update the documentation to say it isn't supported, so that others are not led down the wrong path.

nadetastic commented 1 year ago

Hi @DaryBeattie, @lucasforbes following up with this issue.

I tried this out and Im able to upload PDFs via Amplify Predictions at least with v5.3.3 of amplify-js and extract the data that is within the PDF.

Here's a snippet of what works for me:

try {
  const response = await Predictions.identify({
    text: {
      source: { file },
      format: "FORM", 
    }
  })
  console.log(response)
} catch(e){
  console.log(e)
}

I've setup a full sample at - https://github.com/nadetastic/vite-amplified-react/tree/4913

I'm curious to see what errors you are facing when trying this?

DarylBeattie commented 1 year ago

Ah, sorry, but... it doesn't work for me at all. I'm using v5.3.10 of amplify-js. But, I'm using it in React-Native. Also, you were using format "FORM", I've tried all the formats, none work. You are also passing in a Blob from the web; I tried passing in a Blob as the source too, that didn't work. I also tried passing in an S3 key & level, none of those work.

The error I am facing is simply that I get this result:

{"err": [UnsupportedDocumentException: Request has unsupported document format]}

or, if passing in format: "PLAIN", I get this:

{"response": undefined}
InvalidImageFormatException: Request has invalid image format

(I believe here it's just assuming an image.)

nadetastic commented 1 year ago

Ok, thanks for confirming that you are using React Native. There are some known issues with RN and the Predictions category. One potential work around for you is to try and upload the file as a byte array.

Could you take a look at this comment from a related issue and see if a similar work around will resolve this for you?

Specifically, try to convert the pdf blob to a byte array first before running Predictions.identify() on it?


await Predictions.identify({
    text: { 
        source: { bytes },
        format: "FORM",
    } 
})

DarylBeattie commented 1 year ago

Okay so i tried uploading my PDF as a byte array, after converting the Blob to a byte-array, and I still get the following error:

[UnsupportedDocumentException: Request has unsupported document format]

ovigio commented 1 year ago

How large is the PDF you're uploading? An alternative/workaround would be to convert the pdf pages to images.

DarylBeattie commented 1 year ago

It's 63kb. I've also tried with other sizes up to 700kb, it's all the same. Other readers read them all just fine.

There are many alternatives/workarounds -- and I've already implemented one. But that's not the point: the point is that this doesn't work, and thus this amplify-issue exists and should be fixed.

ovigio commented 1 year ago

I wouldn't say it's exactly an issue, but more of a feature gap, where amplify does not support the asynchronous textract API which accepts PDF documents. I don't work on amplify, but this would be a feature request to support two APIs, one that calls the start analysis API and the other that uses the job ID from that API to get the analysis (GetDocumentAnalysis).

I find that for my use case as well, I cannot directly use the amplify predictions module, because I need to add some rate limiting (usage) to that functionality. And having the client make calls to an expensive operation is a no-no for me without some form of usage contro (rate limit)l.

So I end up just implementing a lambda function that I hook up to an API that I can then use with Amplify's API module.

You could do something similar as well in your case, have a lambda that starts analysis and gets the document analysis. And expose them as API endpoints.

ovigio commented 1 year ago

It's 63kb. I've also tried with other sizes up to 700kb, it's all the same. Other readers read them all just fine.

There are many alternatives/workarounds -- and I've already implemented one. But that's not the point: the point is that this doesn't work, and thus this amplify-issue exists and should be fixed.

Can you please share what work-around you have implemented (might help someone else)?

DarylBeattie commented 12 months ago

What I meant was, there are other ways to read PDFs. For my example, I do it in a Lambda function using a PDF-reader included in the server-library I'm using. Also if you're using React-Native (like I am), you might try reading the PDF on the client -- i.e. read the PDF on-device. Why pay for AWS compute services to read the PDF if you're running on your user's devices anyway, use their phone's powerful CPU to do the work for you.

For most people, if Textract doesn't work through Amplify (which it clearly does not in some cases), you could write a Lambda function to send your PDF to Textract and return the result. Basically, you could implement what Amplify should be doing for you but doesn't work at the moment.

aws-amplify / amplify-js

Support for Textract async uploads #4913