OCR0032: Consolidation of Modern Print data

OpenPecha / OCR-data-consolidation

MIT License

0 stars 0 forks source link

OCR0032: Consolidation of Modern Print data #2

Open ta4tsering opened 3 months ago

ta4tsering commented 3 months ago

Description: For the Modern printed data we have Norbuketaka data and Google Books data. So we need to add all these data to the s3 bucket where Woodblock data has been uploaded.

Completion Criteria: Upload both the Google books data and Norbuketaka data.

Subtask:

[x] download the Google Books data and convert tif images to jpg
[x] upload the Google Books data to OCR/Training_Images on s3
[x] create csv and upload to the hugging_face
- [x] Google Books
- [x] Norbuketatka

ta4tsering commented 3 months ago

Google books images being uploaded and csv for the Google books is created, for the script type and the print_method I have used the work_id to get the bdrc's ttl and parsed it.

ta4tsering commented 3 months ago

Google Books datasets : https://huggingface.co/datasets/ta4tsering/Google_Books_datasets Norbuketaka datasets : https://huggingface.co/datasets/ta4tsering/Norbuketaka_datasets