RFC0138: Training T5 like model to Post Correction of OCR output.
Named Concepts
Explain any new concepts introduced in this request.
Summary
Write a brief summary about the task you are going to complete.
Dependencies
Include all the dependencies you are going to use while implementing.
TensorFlow
NLTK
SpaCy
FAST API
Infrastructures
Include all the infrastructure required for running the task, such as S3 bucket, EC2 server, etc.
Design Illustrations
Include all the pictorial representation of your implementation, such as flowchart, ER diagram, etc.
Justification
Add justification for the strategies you have proposed above.
The rule-based model offers deterministic corrections for straightforward cases and is adept at addressing common error patterns, including handling specific characters and certain font styles. This is particularly beneficial in cases where T5 may struggle to correct due to the complexity of the Tibetan script.
T5(Text -To-Text Transfer Transformer)
Text-To-Text Framework
T5 can be fine-tuned to treat "OCR correction" as a text-to text task
-The rule-based model and the T5 model function concurrently in a parallelised architecture.
Why was the currently proposed design selected over alternatives?
What would be the impact of going with one of the alternative approaches?
Testing-
Describe the kind of testing procedures that are needed as part of fulfilling this request.
Integration testing
Rule-Based Model Testing
T5 Model Testing,
Performance Metrics Evaluation
Implementation Steps
List all the steps involved during implementation.
Prepare dataset
Preprocess
Analysis( optional)
predefined rules to identify potential errors
Fine tun T5 model on OCR text( labelled text)
Rule -Based system
Evaluate model
[ ] PR 1
Estimated time: Data collection( 2 weeks)
Actual time:
[ ] PR 2
Estimated time: Fine-tuning T5 model (1 week)
Actual time:
[ ] PR 3
Estimated time: Validation on unseen data and improve accuracy and deploy (1 week)
Actual time:
RFC0138: Training T5 like model to Post Correction of OCR output.
Named Concepts
Explain any new concepts introduced in this request.
Summary
Write a brief summary about the task you are going to complete.
Dependencies
Include all the dependencies you are going to use while implementing.
Infrastructures
Include all the infrastructure required for running the task, such as S3 bucket, EC2 server, etc.
Design Illustrations
Include all the pictorial representation of your implementation, such as flowchart, ER diagram, etc.
Justification
Add justification for the strategies you have proposed above.
Testing-
Describe the kind of testing procedures that are needed as part of fulfilling this request.
Implementation Steps
List all the steps involved during implementation.
Estimated time: Data collection( 2 weeks) Actual time:
Estimated time: Fine-tuning T5 model (1 week) Actual time:
Reviewed By
Who has reviewed the RFC?