Open athewsey opened 9 months ago
Good idea. Right now we got this: https://github.com/aws-samples/amazon-textract-idp-cdk-constructs/blob/main/lambda/generatecsv/app/main.py#L152 so, when OUTPUT_TYPE is LINEARIZED, it will output the text in reading order using the LAYOUT information. But the method was initially designed for CSV, so it grew too big with 'quick-fix' additions of transformations to other formats. We should extract that into its own package imho. Open for discussion on how to best design that.
With Amazon Textract Response Parser for JavaScript/TypeScript we are working through release of functionality to convert a document to semantic HTML for LLMs, based on Amazon Textract's Layout analysis feature.
Since setting up end-to-end processing pipelines with Textract tends to require cloud infrastructure (e.g. SNS notifications, rate limit management, etc), and this repo is a preferred place to host constructs for IDP - would it make sense for us to add a construct for a Lambda function (using TRP) to convert a Textract result JSON into an HTML file? Users could incorporate it into workflows such as document ingestion for RAG.