aws-solutions-library-samples / guidance-for-low-code-intelligent-document-processing-on-aws

This Guidance provides best practices for building and deploying an intelligent document processing (IDP) architecture that scales with workload demands.
https://aws.amazon.com/solutions/guidance/low-code-intelligent-document-processing-on-aws/
MIT No Attribution
38 stars 14 forks source link

Multi-Document (Batch) Processing with Dynamo and Group Summary #21

Open dannellyz opened 1 year ago

dannellyz commented 1 year ago

@schadem continuing the convo from re:Post.

Goal is to mimic some of the batch upload functionality that is available via the console with some additional magic from the IDP ecosystem. Desired example stack would have the following features:

The example case could be something like: Batch W2 Audit Stack - takes in a list of W2's, processes them as a group via Textract, run them through post-processing augmentation, creates a Dynamo record for each W2 with its extracted features and generated augmentation, and then runs group level analysis's on their collective information summarizing findings into a single csv.

Other future example improvements may include:

schadem commented 1 year ago

sounds good, couple of questions:

The W2 example maps nicely to a Key/Value list, which is already implemented in the sample that imports into a relational database.

What we already have is the ability to generate .csv for Queries, Forms and for Tables.

dannellyz commented 1 year ago

@schadem Those points all make sense and the csv was just an example output for the summarized table data. I guess the main workflow difference from what already exists would just be the bulk input handeling. So to that end maybe the simplified IDP Specific focus would be

The the generic custom lambdas to fill in the gaps would be

schadem commented 1 year ago

Get it. Main difference is essentially that the documents are already on S3 and could be at different bucket/prefixes and the KV-Pairs should be in DDB. Current workflow samples trigger when a new object is put at one location. Yeah, I could build that.

Different options. The current implementation would also allow for a manifest file to be put into the DocumentUpload location. Essentially we would just need a script to create those manifest files, but also configure the permissions to read from those other buckets/prefixes, essentially create roles for those or grant "" ( is not best practice obviously).

Or we trigger from an SQS queue for example instead of the PUT and then submit all the S3 locations to that one. Still requires the permissions obviously.

dannellyz commented 1 year ago

I think either of those different options works well. The tie in at the end would be the ability to interact with all of the returned JSON objects together. So maybe the Dynamo step could move its place to the end.