Closed tboenig closed 6 months ago
Hello!
Thank you for your contribution!
I adapted the description to follow HTR-United's schema (and took the liberty to compute a little more metadata). Can you confirm the metadata in the following is correct?
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: gt_structure_text
url: https://github.com/OCR-D/gt_structure_text
authors:
- name: Matthias
surname: Boenig
orcid: 0000-0003-4615-4753
roles:
- transcriber
- aligner
- project-manager
- quality-control
- digitization
- support
institutions: []
description: >-
The OCR-D Ground Truth text and structure corpus was created between
2015-2017. In the years since 2017, this corpus has been further curated and
supplemented with metadata where appropriate. The corpus includes page XML
files within annotations of the text and structure include. The data is based
on transcription data stored in the German Text Archive (DTA)
(https://www.deutschestextarchiv.de/).
project-name: OCR-D
project-website: https://ocr-d.de/
language:
- eng
- fra
- deu
- heb
- lat
production-software: Aletheia
automatically-aligned: false
script:
- iso: Latn
- iso: Goth
script-type: only-typed
time:
notAfter: '1900'
notBefore: '1500'
hands:
count: less-than-11
precision: exact
license:
name: CC-BY-SA 4.0
url: https://creativecommons.org/licenses/by-sa/4.0/
format: Page-XML
volume:
- count: 640976
metric: characters
- count: 217
metric: files
- count: 6608
metric: lines
- count: 1647
metric: regions
citation-file-link: https://raw.githubusercontent.com/OCR-D/gt_structure_text/main/CITATION.cff
transcription-guidelines: OCR-D Ground Truth Guidelines https://ocr-d.de/en/gt-guidelines/trans/
characters:
members:
- e
- t
- /
- a
- c
- '0'
- n
- r
- m
- h
- p
- s
- o
- g
- '1'
- '2'
- f
- '7'
- '9'
- E
- .
- i
- '-'
- '5'
- '4'
- d
- <
- l
- '{'
- ':'
- P
- A
- G
- '}'
- U
- x
- '>'
- '3'
- '8'
- '6'
- b
mode: NFD
Hello,
Thanks for the additions. However, I do not understand the section from: characters and members.
The characters/members entry is simply the character set present in the ground truth. It was generated with Chocomufin (https://github.com/alix-tz/gt_structure_text/actions/runs/7916917450/job/21611834392).
Also, I am not sure the choice of "Latin" and "Gothic" is correct to describe the script. In the ISO norm, "Gothic" refers to a different type of script (https://en.m.wikipedia.org/wiki/Gothic_alphabet). I remember discussions with Tobias Hodel which led us to include a specifier for script for such cases where the script would be Latin, completemented with a specifier such as "fraktur" (see https://github.com/HTR-United/schema/issues/4, and also not that I am not a specialist of this font specifically).
The characters/members entry is simply the character set present in the ground truth. It was generated with Chocomufin (https://github.com/alix-tz/gt_structure_text/actions/runs/7916917450/job/21611834392).
Oh, I just realized that Chocomufin doesn't support PAGE XML yet. My bad, it explains the weird character set. Let's leave the character set out of the entry for now. I the rest ok for you?
Hi @alix-tz , sorry for the long time for wait. The metadata set is ok, without the charactar set from Chocomufin. You can see my decision or proposal for charactar documentation and mapping to transcription and structure level under: https://ocr-d.de/gt_structure_text/overview-level Thank you.
Hi @alix-tz, I'm just wondering why the data set has not yet been included in the catalog. Are there still errors? Should I make corrections? In the hope that the record will be added soon.
Best regards from Berlin tboenig
@tboenig I suggest preparing a pull request. On your fork, create a directory ocr-d
under https://github.com/HTR-United/htr-united/tree/master/catalog and put your metadata into a file gt_structure_text.yml
there (assuming we will put many more files from other OCR-D datasets under that same directory hence). Then make a commit under a new branch, then open a PR from that against upstream.
linked to #141
Hello ! Thank you for your ground truth repository and catalog.
Regards tboenig
Here is our dataset YAML file: