This PR introduces trainable sequence-to-sequence models for generative tasks like OCR error correction or field re-formatting (e.g. date normalization).
Enable new 'seq2seq' task type for the training script
Add synthetic data demo on date format normalization with T5/ByT5 (e.g. prompt Convert dates to YYYY-MM-DD: Dec 31 1999 to yield 1999-12-31)
Simplify the output format of the custom (boxes + OCR transcript reviews) Ground Truth task UI, and PoC training the seq2seq model with SMGT annotations instead of plain src/target text (Generates prompts from class name, raw text, corrected text e.g. Normalize Agreement Effective Date: Dec 31 1999 to yield 1999-12-31, or however labelers have standardised the field)
Some important caveats:
Since HF LayoutLM-family models don't have a generator stack, this MVP experiments only with plain-text models (e.g. ByT5) and does not pull through raw OCR word bounding boxes / page images. Maybe users could hack together a model by splicing LayoutLM together with a pre-trained generator and fine-tuning? Or perhaps it wouldn't work without pre-training more extensively / from-scratch.
If you had a layout-aware/multi-modal model, you'd probably want to prompt it with more of the page context - not just treat entity extraction and text normalization as two separate tasks.
The date normalization demo generates 1k+ synthetic examples to achieve the ~94% accuracy it does... The field normalization seq2seq task seems likely to be quite data-hungry compared to the effort of annotating pages with transcription reviews?
Testing done:
Re-verified notebooks on an existing environment (didn't re-build from scratch)
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Issue #, if available: N/A
Description of changes:
This PR introduces trainable sequence-to-sequence models for generative tasks like OCR error correction or field re-formatting (e.g. date normalization).
Convert dates to YYYY-MM-DD: Dec 31 1999
to yield1999-12-31
)Normalize Agreement Effective Date: Dec 31 1999
to yield1999-12-31
, or however labelers have standardised the field)Some important caveats:
Testing done:
Re-verified notebooks on an existing environment (didn't re-build from scratch)
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.