aws-samples / amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding
MIT No Attribution
88 stars 25 forks source link

[Enhancement] Merge layout-aware and generative model components #28

Open athewsey opened 1 year ago

athewsey commented 1 year ago

As of #26, users can train generative models to normalize entity text after extraction: For example to standardize date or currency formats, or correct common OCR error patterns.

This is not ideal though, as the normalization model only sees the specific extracted text without the surrounding context (which could for example give locale cues whether a date is more likely to be MM/DD/YYYY or DD/MM/YYYY).

It would be better if we could directly merge a generative output onto the layout-aware model, and fine tune normalized extraction directly.