The main goal of this release (per #164) was to add Textract Layout support and particularly to support rendering documents to semantic markup/markdown, for use with text-only GenAI foundation models like Titan/Claude/Llama/etc on Amazon Bedrock.
Unfortunately, that turned out to be more work than expected and distracted by a couple of other outstanding tasks 😅
Positioning this as a major release because of the possibility of breaking changes in some edge cases, but tentatively expecting a very easy upgrade for regular users.
(Detailed also reflected in draft CHANGELOG.md)
Add support for Textract Layout analysis and rendering documents to semantic HTML
Add/fix support for TABLE_FOOTER and TABLE_TITLE elements that weren't fully linked through before
Add support for SIGNATURE detection results
Add some missing API models and utility functions to top-level exports of the library, and improve docstrings on top-level exports generally.
Significantly refactor mixins & base classes to try and reduce fragile class hierarchy and internal tracking states
More use of mixins, rather than abstract base classes
WithContent mixin more generalized to support the Signature and Layout features
Pages now keep a register of parsed objects by block ID, and this is used to reduce state in collection classes. For example, rather than Line keeping its own internal list of _words, it can just re-traverse the child relationships on its block and fetch the parsed objects on-demand.
BREAKING(?): As part of the refactor, CellBase became pretty redundant so is no longer exported... But don't believe it would affect users anyway
BREAKING(?): This different approach to state tracking introduces minor differences in when and how warnings and errors would be triggered for invalid or incomplete Textract JSON (e.g. relationships to missing block IDs, or unexpected block types). Believe these will only affect edge cases, since most users should be working with actual valid API results.
Fixed Table.nCells to report the number of separate cells after merges are considered, not just the number of sub-cells (== nRows * nColumns) which is not useful.
Support alternative KEY and VALUE blocks for Forms K-V data, observed in place of the typical KEY_VALUE_SET blocks for some test data files
Was this a temporary API issue? A change going forward? Not quite sure... As far as I was aware, the API should still be outputting KEY_VALUE_SET.
Testing done:
Unit tests expanded to (mostly...) cover new features. Haven't gone overboard on Layout testing yet as there's potential to shift it around based on alpha feedback.
Extended postbuild IIFE test to validate that expected submodules are accessible from the global trp object.
Issue #, if available: #164, #171
Description of changes:
The main goal of this release (per #164) was to add Textract Layout support and particularly to support rendering documents to semantic markup/markdown, for use with text-only GenAI foundation models like Titan/Claude/Llama/etc on Amazon Bedrock.
Unfortunately, that turned out to be more work than expected and distracted by a couple of other outstanding tasks 😅
Positioning this as a major release because of the possibility of breaking changes in some edge cases, but tentatively expecting a very easy upgrade for regular users.
(Detailed also reflected in draft
CHANGELOG.md
)TABLE_FOOTER
andTABLE_TITLE
elements that weren't fully linked through beforeSIGNATURE
detection resultsWithContent
mixin more generalized to support the Signature and Layout featuresPage
s now keep a register of parsed objects by block ID, and this is used to reduce state in collection classes. For example, rather thanLine
keeping its own internal list of_words
, it can just re-traverse the child relationships on its block and fetch the parsed objects on-demand.CellBase
became pretty redundant so is no longer exported... But don't believe it would affect users anywayTable.nCells
to report the number of separate cells after merges are considered, not just the number of sub-cells (== nRows * nColumns) which is not useful.KEY
andVALUE
blocks for Forms K-V data, observed in place of the typicalKEY_VALUE_SET
blocks for some test data filesKEY_VALUE_SET
.Testing done:
trp
object.Initial alpha release available at amazon-textract-response-parser v0.4.0-alpha.3 on NPM - please share your feedback and we'll aim to push to stable release soon!
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.