OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

RFC0138: Training T5 like model to Post Correction of OCR output. #419

Open Chotso opened 8 months ago

Chotso commented 8 months ago

RFC0138: Training T5 like model to Post Correction of OCR output.

Named Concepts

Explain any new concepts introduced in this request. Screenshot 2024-01-21 at 22 17 10

Summary

Write a brief summary about the task you are going to complete.

Dependencies

Include all the dependencies you are going to use while implementing.

Infrastructures

Include all the infrastructure required for running the task, such as S3 bucket, EC2 server, etc.

Design Illustrations

Include all the pictorial representation of your implementation, such as flowchart, ER diagram, etc.

OCR Correction workflow

Justification

Add justification for the strategies you have proposed above.

Why was the currently proposed design selected over alternatives?

What would be the impact of going with one of the alternative approaches?

Testing-

Describe the kind of testing procedures that are needed as part of fulfilling this request.

Implementation Steps

List all the steps involved during implementation.

  1. Prepare dataset
  2. Preprocess
  3. Analysis( optional)
    • predefined rules to identify potential errors
  4. Fine tun T5 model on OCR text( labelled text)
  5. Rule -Based system
  6. Evaluate model

Reviewed By

Who has reviewed the RFC?