ta4tsering commented 4 months ago

RFC133: Creating a benchmark dataset for OCR

Named Concepts

Summary

create benchmark dataset for OCR from all the transcribed line images, use them to filter out line images randomly from each batch or works.

Dependencies

Include all the dependencies you are going to use while implementing.

Infrastructures

Include all the infrastructure required for running the task, such as S3 bucket, EC2 server, etc.

Design Illustrations

Justification

This is the best method for now as it will be much quicker to take the benchmark data from the transcribed data then to take new images to be transcribed from scratch.

Testing

Describe the kind of testing procedures that are needed as part of fulfilling this request.

Implementation Steps

[x] OpenPecha/Create_OCR_benchmark_data#1 Estimated time: 1 hr. Actual time: 1 hr
[x] OpenPecha/Create_OCR_benchmark_data#2 Estimated time: 4 hr Actual time: 4hr
[ ] OpenPecha/Create_OCR_benchmark_data#3 Estimated time: 2 hr Actual time:
[ ] OpenPecha/Create_OCR_benchmark_data#4 Estimated time: 6 hr Actual time:
[ ] OpenPecha/Create_OCR_benchmark_data#5 Estimated time: 6 hr Actual time:
[ ] OpenPecha/Create_OCR_benchmark_data#6 Estimated time: 8 hr Actual time:

Reviewed By

@kaldan007

kaldan007 commented 4 months ago

@ta4tsering we need to create separate benchmark for each writing style we can mix ume in uchen benchmark as it won't be fair. Other than that it looks good to me.

ta4tsering commented 3 months ago

Okay, will make it for Uchan only.

OpenPecha / Requests