awslabs / project-lakechain

:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
https://awslabs.github.io/project-lakechain/
Apache License 2.0
108 stars 19 forks source link

Feature request: Create Textract Middleware #46

Open HQarroum opened 1 month ago

HQarroum commented 1 month ago

Use case

Implement a middleware that exposes the Textract capabilities within a Lakechain document processing pipeline.

Solution/User Experience

Below is the temporary design for an API for this middleware.

Table data extraction. Input(s) : PDF, Images Output(s) : 'markdown' and/or 'text' and/or 'excel' and/or 'csv' and/or 'html'

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new TableExtractionTask.Builder()
.withOutputType('markdown' | 'text' | 'excel' | 'csv' | 'html')
// Defines whether a document will be created for each table,
// or whether to group them all in one document.
.withGroupOutput(false)
.build())
.build();

Key value pair extraction. Input(s) : PDF, Images Output(s) : 'json' | 'csv'

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new KvExtractionTask.Builder()
.withOutputType('json' | 'csv')
.build())
.build();

Visualize task. Input(s) : PDF, Images Output(s) : One or multiple images

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new ImageVisualizationTask.Builder()
.withCheckboxes(true)
.withKeyValues(true)
.withTables(true)
.withSearch('rent', { top_k: 10 })
.build())
.build();

Expense analysis. Input(s) : PDF, Images Output(s) : CSV

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new ExpenseAnalysisTask.Builder()
.withOutputType('csv')
.build())
.build();

ID Analysis. Input(s) : PDF, Images Output(s) : JSON, CSV

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new IdAnalysisTask.Builder()
.withOutputType('json' | 'csv')
.build())
.build();

Layout Analysis. Input(s) : PDF, Images Output(s) : PDF, Images + Metadata Exports layout information in a structured way in the document metadata.

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new LayoutAnalysisTask.Builder()
.build())
.build();

Alternative solutions

No response

HQarroum commented 1 month ago

Adding support for optional custom text linearization functions as per @mrtj comment.

Text linearization function.

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new TableExtractionTask.Builder()
.withOutputType('text')
.withLinearizationFunction(new TextLinearizationFunction.Builder()
.withKeyPrefix('<key>')
.withKeySuffix('</key>')
.withValuePrefix('<value>')
.withValueSuffix('</value>')
.build())
.build())
.build();

HTML linearization function.

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new TableExtractionTask.Builder()
.withOutputType('text')
.withLinearizationFunction(new HtmlLinearizationFunction.Builder()
.withTableCellHeaderPrefix('<td>')
.withTableCellHeaderSuffix('</td>')
.withKeyPrefix('<key>')
.withKeySuffix('</key>')
.withValuePrefix('<value>')
.withValueSuffix('</value>')
.build())
.build())
.build();

@mrtj does it look good to you ?