aws-samples / amazon-textract-idp-cdk-constructs

MIT No Attribution
30 stars 13 forks source link

[Feature proposal] Lambda to convert Textract result (on S3?) to HTML (on S3?) #114

Open athewsey opened 9 months ago

athewsey commented 9 months ago

With Amazon Textract Response Parser for JavaScript/TypeScript we are working through release of functionality to convert a document to semantic HTML for LLMs, based on Amazon Textract's Layout analysis feature.

Since setting up end-to-end processing pipelines with Textract tends to require cloud infrastructure (e.g. SNS notifications, rate limit management, etc), and this repo is a preferred place to host constructs for IDP - would it make sense for us to add a construct for a Lambda function (using TRP) to convert a Textract result JSON into an HTML file? Users could incorporate it into workflows such as document ingestion for RAG.

schadem commented 9 months ago

Good idea. Right now we got this: https://github.com/aws-samples/amazon-textract-idp-cdk-constructs/blob/main/lambda/generatecsv/app/main.py#L152 so, when OUTPUT_TYPE is LINEARIZED, it will output the text in reading order using the LAYOUT information. But the method was initially designed for CSV, so it grew too big with 'quick-fix' additions of transformations to other formats. We should extract that into its own package imho. Open for discussion on how to best design that.