feat:Combine AsyncToJson and TextractGenericAsyncSfnTask

aws-samples / amazon-textract-idp-cdk-constructs

MIT No Attribution

30 stars 13 forks source link

feat:Combine AsyncToJson and TextractGenericAsyncSfnTask #27

Open schadem opened 1 year ago

schadem commented 1 year ago

On the other hand, I am thinking why not put AsyncToJson as an inherent function of TextractGenericSyncSfnTask instead of a separate Task? This way we always get consistent output. If we don’t want to delete the original JSONs, we can provide a prop retain_orig_resopnse= true/false which will allow developers to either retain or get rid of the original JSONs. IMO, we should probably implement the latter (so the output manifest in the previous point will have output_path as this consolidated JSON always, plus the original output path if the retain_orig_resopnse prop is set to true). This applies for AnalyzeDoc & AnalyzeExpense Async except that there’s no post processing required for Expense since the JSON is always combined by Textract by default (afaik).

schadem commented 1 year ago

My first implementation actually had that, but with multi-page documents, the memory requirements for the Lambda to combine the JSON response become huge (250 MB json response for a 800 page document turn into 1.5 GB memory requirements), which was quite a lot of mem for this function. So I moved it out to be more flexible on the mem config for the Lambda and potentially route to different mem-config-optimized Lambdas based on the document size. Maybe create a Construct that combines both would make sense.

anjanvb commented 1 year ago

I like the idea of a combined construct. If size is an issue how about we do the AsyncToJson processing in the second lambda textract_async_sns_listener which merely does logging and callback to the Sfn workflow upon notification from Textract. We would still have two separate Lambdas alleviating the memory concerns, and we provide a memory prop to TextractGenericAsyncSfnTask that allows the user to set the memory for textract_async_sns_listener Lambda.

We can either combine the logic of textract_async_sns_listener and async_to_json OR we can asynchronously call async_to_json from textract_async_sns_listener. I think I would prefer, the former. The combined lambda would callback to Sfn with the path for the final JSON for both AnalyzeDoc and AnalyzeExpense.

schadem commented 1 year ago

The textract_async_sns_listener did have the mem issue. Only at time of notification I could combine the responses and that sns listener then because a big Lambda.

I was thinking of combining the 2 Constructs into 1, combining the TextractGenericAsyncSfnTask and AsyncToJson. Ultimately customer may want to optimize cost and have a different routing for different document size.

anjanvb commented 1 year ago

Got it. Well I think in that case a combo Constructs just adds to our overhead of maintaining it. If we simply keep them as two separate constructs, there's a lot more freedom from setting up custom workflow in an "as needed" basis.