aws-samples / amazon-textract-idp-cdk-constructs

MIT No Attribution
30 stars 13 forks source link

Is there a way for TextractGenericAsyncSfnTask to use "features" (e.g. FORMS)? #119

Open OperationalFallacy opened 6 months ago

OperationalFallacy commented 6 months ago

Here's a construct:

  const textractAsyncTask = new TextractGenericAsyncSfnTask(
      this,
      "TextractAsync",
      {
        s3OutputBucket: this._outputBucket.bucketName,
        textractAPI: 'GENERIC',
        s3TempOutputPrefix: s3TempOutputPrefix,
        integrationPattern: IntegrationPattern.WAIT_FOR_TASK_TOKEN,
        lambdaLogLevel: "DEBUG",
        taskTimeout: Timeout.duration(Duration.minutes(2)),
        input: TaskInput.fromObject({
          Token: JsonPath.taskToken,
          ExecutionId: JsonPath.stringAt("$$.Execution.Id"),
          Payload: JsonPath.entirePayload,
          FeatureTypes: ['TABLES', 'FORMS']
        }),
        resultPath: "$.textract_result",
      }
    );

Looking at the props:

/** Which Textract API to call
 * ALL asynchronous Textract API calls are supported. Valid values are GENERIC | EXPENSE | LENDING.
 *
 * For GENERIC, when called without features (e. g. FORMS, TABLES, QUERIES, SIGNATURE), StartDetectText is called and only OCR is returned.
 * For GENERIC, when called with a feature (e. g. FORMS, TABLES, QUERIES, SIGNATURE),  StartAnalyzeDocument is called.
 * @default - GENERIC */
readonly textractAPI?: 'GENERIC' | 'EXPENSE' | 'LENDING';

But what "when called with a feature" means? I couldn't find how to configure it via construct props. Perhaps it should be something else, like "manifest", but that's not defined anywhere in the API. I've tried to look at the lambda functions, and gave up - the code there is not easy to understand

OperationalFallacy commented 6 months ago

I found a few bugs, I assumed that I can simply do this and it should work

        input: TaskInput.fromObject({
          Token: JsonPath.taskToken,
          ExecutionId: JsonPath.stringAt("$$.Execution.Id"),
          Payload: JsonPath.entirePayload,
          textract_features: ["TABLES", "FORMS"],
        }),

I think decider_main.py ignores features field, so I tried to deployed my own patched function.

And then I found that TextractPOCDecider construct has deciderFunction property to setup a custom function, but it ignores it and always creates own.

MichaelWalker-git commented 2 months ago

Hi @OperationalFallacy

Thank you for filing this bug. I apologize for the long delay. Would you mind making a pull request?