Multi-Document (Batch) Processing with Dynamo and Group Summary

dannellyz commented 1 year ago

@schadem continuing the convo from re:Post.

Goal is to mimic some of the batch upload functionality that is available via the console with some additional magic from the IDP ecosystem. Desired example stack would have the following features:

Accept a batch of documents (list of S3 locations)
Run through pre-processing similar to that of existing examples (Decider + Split + ...)
Process each document via Textract
Run each document result through a post-processing (lambda that takes identified tables and makes another API call to enrich)
Take processed information from each document and create record in Dynamo DB table
Summarize features from all documents in the batch and create a consolidated .csv (Total documents+ total rows in tables)
Save .csv back into S3

The example case could be something like: Batch W2 Audit Stack - takes in a list of W2's, processes them as a group via Textract, run them through post-processing augmentation, creates a Dynamo record for each W2 with its extracted features and generated augmentation, and then runs group level analysis's on their collective information summarizing findings into a single csv.

Other future example improvements may include:

checking Dynamo ahead of processing to see if specifically selected W2 has already undergone that process

schadem commented 1 year ago

sounds good, couple of questions:

Accept a batch of documents (list of S3 locations)
Run through pre-processing similar to that of existing examples (Decider + Split + ...)
Process each document via Textract
Run each document result through a post-processing (lambda that takes identified tables and makes another API call to enrich) -> "makes another API call to enrich" - I would add a 'mock' Lambda that can be populated with whatever that other API call is
Take processed information from each document and create record in Dynamo DB table -> tables can take all kinds of shapes (NxM). That is where the complexity is. Definitely possible to have all rows as an item in the DDB table, but the subsequent .csv generation is not consistend then. Having key/value pairs is definitely easier.
Summarize features from all documents in the batch and create a consolidated .csv (Total documents+ total rows in tables) -> This could be a DDB query, because all the information is in the DDB table
Save .csv back into S3 -> What is the content of this .csv? Summary of all entries from DDB?

The W2 example maps nicely to a Key/Value list, which is already implemented in the sample that imports into a relational database.

What we already have is the ability to generate .csv for Queries, Forms and for Tables.

dannellyz commented 1 year ago

@schadem Those points all make sense and the csv was just an example output for the summarized table data. I guess the main workflow difference from what already exists would just be the bulk input handeling. So to that end maybe the simplified IDP Specific focus would be

BulkDocummentProcessor (S3 list -> map to Async workflow)
DynamoKVPairs (JSON response -> key-value pairs in Dynamo record)

The the generic custom lambdas to fill in the gaps would be

GenericPostProcessor (JSON -> JSON)
GenericDynamoQuery (features -> combined JSON)

schadem commented 1 year ago

Get it. Main difference is essentially that the documents are already on S3 and could be at different bucket/prefixes and the KV-Pairs should be in DDB. Current workflow samples trigger when a new object is put at one location. Yeah, I could build that.

Different options. The current implementation would also allow for a manifest file to be put into the DocumentUpload location. Essentially we would just need a script to create those manifest files, but also configure the permissions to read from those other buckets/prefixes, essentially create roles for those or grant "" ( is not best practice obviously).

Or we trigger from an SQS queue for example instead of the PUT and then submit all the S3 locations to that one. Still requires the permissions obviously.

dannellyz commented 1 year ago

I think either of those different options works well. The tie in at the end would be the ability to interact with all of the returned JSON objects together. So maybe the Dynamo step could move its place to the end.

aws-solutions-library-samples / guidance-for-low-code-intelligent-document-processing-on-aws

Multi-Document (Batch) Processing with Dynamo and Group Summary #21