Open HQarroum opened 3 months ago
Adding support for optional custom text linearization functions as per @mrtj comment.
Text linearization function.
const textract = new TextractProcessor.Builder() .withScope(this) .withIdentifier('Trigger') .withCacheStorage(cache) .withTask(new TableExtractionTask.Builder() .withOutputType('text') .withLinearizationFunction(new TextLinearizationFunction.Builder() .withKeyPrefix('<key>') .withKeySuffix('</key>') .withValuePrefix('<value>') .withValueSuffix('</value>') .build()) .build()) .build();
HTML linearization function.
const textract = new TextractProcessor.Builder() .withScope(this) .withIdentifier('Trigger') .withCacheStorage(cache) .withTask(new TableExtractionTask.Builder() .withOutputType('text') .withLinearizationFunction(new HtmlLinearizationFunction.Builder() .withTableCellHeaderPrefix('<td>') .withTableCellHeaderSuffix('</td>') .withKeyPrefix('<key>') .withKeySuffix('</key>') .withValuePrefix('<value>') .withValueSuffix('</value>') .build()) .build()) .build();
@mrtj does it look good to you ?
Use case
Implement a middleware that exposes the Textract capabilities within a Lakechain document processing pipeline.
Solution/User Experience
Below is the temporary design for an API for this middleware.
Alternative solutions
No response