OCR0028: formating Line segmentation data

jim-gyas commented 3 weeks ago

Description: The current Training data we have for the line segmentation is too less so we need to gather around more training data for the line segmentation from our existing data.

Resources:

Line segmentations from HTR team on aws
Line segmentation from Google Books data Google Books OCR output and Images
Transkribus data of two works

Related Card: Google Books data creation

Completion Criteria: All the training data for the line segmentation in a single format. Possible formats

Jsonl format
XML format

Implementation plan

Sub Task

[x] 1. Set Up Credentials (Configure AWS, Google Drive, and Transkribus access credentials.)
[x] 2. Download Data (Download line segmentation data from AWS (HTR team), Google Drive (Google Books OCR), and Transkribus.)
[x] 3. Extract Files (Extract HTML files from downloaded archives.)
- [x] 4. Parse HTML Files (Write scripts to parse HTML files and extract line segmentation data for Google Books OCR.)
- [x] 5. Convert HTML Data to Single Format (Convert parsed HTML data into JSONL or XML format for Google Books OCR.)
[x] 6. Parse XML Files (Write scripts to parse XML files and extract line segmentation data for Transkribus and HTR team data.)
[x] 7. Convert XML Data to Single Format (Convert parsed XML data into JSONL or XML format for Transkribus and HTR team data.)
[ ] 8. Normalize and Merge Data (Normalize data for consistency and merge datasets from all sources.)
[ ] 9. Validate and Store Final Dataset (Validate the merged dataset and store it in a designated folder)

jim-gyas commented 2 weeks ago

@ta4tsering and @kaldan007 , In Transkribus, there are several data collections available. Based on my review, here are my findings:

The collections marked with a check (✓) are confirmed and will be taken for parsing. The collections marked with a circle (○) are under consideration due to doubts about the language suitability. The collections marked with a cross (✗) cannot be accessed, as they are returning a 404 error. Please see the attached screenshot for a visual reference of the collections.

kaldan007 commented 2 weeks ago

@jim-gyas u can ignore all the confuse except derge-kangyur

jim-gyas commented 5 days ago

Example of Google Books Xml Data Format For Line Segmentation.

jim-gyas commented 5 days ago

Example of Google Books Jsonl Data Format For Line Segmentation

jim-gyas commented 4 days ago

Example of HTR Team Xml Data Format for Line Segmentation

`<?xml version="1.0" ?>

HTR Team

`

jim-gyas commented 4 days ago

Example of HTR Team Xml Data Format for Line Segmentation

{"id": "Correction-7_IMG_4305.jpg", "image": "https://s3.amazonaws.com/monlam.ai.ocr/line_segmentations/Images/Correction-7_IMG_4305.jpg", "spans": [{"id": "b6b09c07-5c1a-495f-86d0-7d2b7b8b7284", "height": 5, "width": 106, "center": [754.0, 952.5], "points": [[701, 950], [701, 955], [807, 955], [807, 950]]}, {"id": "d4b3469e-76ab-4d44-a4be-5b3bfa007b36", "height": -120, "width": 217, "center": [791.5, 1107.0], "points": [[683, 1167], [683, 1047], [900, 1047], [900, 1167]]}, {"id": "1f8387f9-f213-42c9-bae3-3c77cc318128", "height": -185, "width": 295, "center": [844.5, 1416.5], "points": [[697, 1509], [697, 1324], [992, 1324], [992, 1509]]}, {"id": "6a1fcfa2-6f9c-4ff7-b4d1-750c6094e421", "height": 9, "width": 97, "center": [1317.5, 3026.5], "points": [[1269, 3022], [1269, 3031], [1366, 3031], [1366, 3022]]}]}

jim-gyas commented 3 days ago

Number Of Image Files Equals To Number Of Xml Files In HTR Team

ta4tsering commented 8 hours ago

currently working on Transkribus data, will be done tomorrow.

OpenPecha / Formatting_line_segmentation