OpenPecha / Formatting_line_segmentation

MIT License
0 stars 0 forks source link

OCR0028: formating Line segmentation data #1

Open jim-gyas opened 3 weeks ago

jim-gyas commented 3 weeks ago

Description: The current Training data we have for the line segmentation is too less so we need to gather around more training data for the line segmentation from our existing data.

Resources:

  1. Line segmentations from HTR team on aws
  2. Line segmentation from Google Books data Google Books OCR output and Images
  3. Transkribus data of two works

Related Card: Google Books data creation

Completion Criteria: All the training data for the line segmentation in a single format. Possible formats

  1. Jsonl format
  2. XML format

    Implementation plan

Image

Sub Task

jim-gyas commented 2 weeks ago

@ta4tsering and @kaldan007 , In Transkribus, there are several data collections available. Based on my review, here are my findings:

The collections marked with a check (✓) are confirmed and will be taken for parsing. The collections marked with a circle (○) are under consideration due to doubts about the language suitability. The collections marked with a cross (✗) cannot be accessed, as they are returning a 404 error. Please see the attached screenshot for a visual reference of the collections.

Image

kaldan007 commented 2 weeks ago

@jim-gyas u can ignore all the confuse except derge-kangyur

jim-gyas commented 5 days ago

Example of Google Books Xml Data Format For Line Segmentation.

Image

jim-gyas commented 5 days ago

Example of Google Books Jsonl Data Format For Line Segmentation

Image

jim-gyas commented 4 days ago

Example of HTR Team Xml Data Format for Line Segmentation

`<?xml version="1.0" ?>

HTR Team

`

jim-gyas commented 4 days ago

Example of HTR Team Xml Data Format for Line Segmentation

{"id": "Correction-7_IMG_4305.jpg", "image": "https://s3.amazonaws.com/monlam.ai.ocr/line_segmentations/Images/Correction-7_IMG_4305.jpg", "spans": [{"id": "b6b09c07-5c1a-495f-86d0-7d2b7b8b7284", "height": 5, "width": 106, "center": [754.0, 952.5], "points": [[701, 950], [701, 955], [807, 955], [807, 950]]}, {"id": "d4b3469e-76ab-4d44-a4be-5b3bfa007b36", "height": -120, "width": 217, "center": [791.5, 1107.0], "points": [[683, 1167], [683, 1047], [900, 1047], [900, 1167]]}, {"id": "1f8387f9-f213-42c9-bae3-3c77cc318128", "height": -185, "width": 295, "center": [844.5, 1416.5], "points": [[697, 1509], [697, 1324], [992, 1324], [992, 1509]]}, {"id": "6a1fcfa2-6f9c-4ff7-b4d1-750c6094e421", "height": 9, "width": 97, "center": [1317.5, 3026.5], "points": [[1269, 3022], [1269, 3031], [1366, 3031], [1366, 3022]]}]}

jim-gyas commented 3 days ago

Number Of Image Files Equals To Number Of Xml Files In HTR Team

Image

ta4tsering commented 8 hours ago

currently working on Transkribus data, will be done tomorrow.