HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents
https://htr-united.github.io
Creative Commons Zero v1.0 Universal
38 stars 32 forks source link

Adding dataset gt_structure_text #137

Closed tboenig closed 6 months ago

tboenig commented 7 months ago

Hello ! Thank you for your ground truth repository and catalog.

Regards tboenig

Here is our dataset YAML file:

schema: https://tboenig.github.io/gt-metadata/schema/2023-10-25/schema.json
title: gt_structure_text
url: https://github.com/OCR-D/gt_structure_text
authors:
  - name: Matthias
    surname: Boenig
    orcid: 0000-0003-4615-4753
    roles:
      - transcriber
      - aligner
      - project-manager
      - quality-control
      - digitization
      - support
description: >-
  The OCR-D Ground Truth text and structure corpus was created between 2015
  -2017. In the years since 2017, this corpus has been further curated and
  supplemented with metadata where appropriate. The corpus includes page XML
  files within annotations of the text and structure include. The data is based
  on transcription data stored in the German Text Archive (DTA)
  (https://www.deutschestextarchiv.de/).
project-name: OCR-D
project-website: https://ocr-d.de/
language:
  - eng
  - fra
  - deu
  - heb
  - lat
production-software: Aletheia
script:
  - Latn
  - Goth
script-type: print
time:
  notBefore: '1500'
  notAfter: '1900'
hands:
  count: '3'
  level: levelmix
license:
  - name: CC-BY-SA 4.0
    url: https://creativecommons.org/licenses/by-sa/4.0/
gtType: data_structure_and_text
format: Page-XML
citation-file-link: https://github.com/OCR-D/gt_structure_text/blob/main/CITATION.cff
transcription-guidelines: OCR-D Ground Truth Guidelines https://ocr-d.de/en/gt-guidelines/trans/
alix-tz commented 7 months ago

Hello!

Thank you for your contribution!

I adapted the description to follow HTR-United's schema (and took the liberty to compute a little more metadata). Can you confirm the metadata in the following is correct?

schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: gt_structure_text
url: https://github.com/OCR-D/gt_structure_text
authors:
- name: Matthias
  surname: Boenig
  orcid: 0000-0003-4615-4753
  roles:
  - transcriber
  - aligner
  - project-manager
  - quality-control
  - digitization
  - support
institutions: []
description: >-
  The OCR-D Ground Truth text and structure corpus was created between
  2015-2017. In the years since 2017, this corpus has been further curated and
  supplemented with metadata where appropriate. The corpus includes page XML
  files within annotations of the text and structure include. The data is based
  on transcription data stored in the German Text Archive (DTA)
  (https://www.deutschestextarchiv.de/).
project-name: OCR-D
project-website: https://ocr-d.de/
language:
- eng
- fra
- deu
- heb
- lat
production-software: Aletheia
automatically-aligned: false
script:
- iso: Latn
- iso: Goth
script-type: only-typed
time:
  notAfter: '1900'
  notBefore: '1500'
hands:
  count: less-than-11
  precision: exact
license:
  name: CC-BY-SA 4.0
  url: https://creativecommons.org/licenses/by-sa/4.0/
format: Page-XML
volume:
- count: 640976
  metric: characters
- count: 217
  metric: files
- count: 6608
  metric: lines
- count: 1647
  metric: regions
citation-file-link: https://raw.githubusercontent.com/OCR-D/gt_structure_text/main/CITATION.cff
transcription-guidelines: OCR-D Ground Truth Guidelines https://ocr-d.de/en/gt-guidelines/trans/
characters:
  members:
  - e
  - t
  - /
  - a
  - c
  - '0'
  - n
  - r
  - m
  - h
  - p
  - s
  - o
  - g
  - '1'
  - '2'
  - f
  - '7'
  - '9'
  - E
  - .
  - i
  - '-'
  - '5'
  - '4'
  - d
  - <
  - l
  - '{'
  - ':'
  - P
  - A
  - G
  - '}'
  - U
  - x
  - '>'
  - '3'
  - '8'
  - '6'
  - b
  mode: NFD
tboenig commented 7 months ago

Hello,

Thanks for the additions. However, I do not understand the section from: characters and members.

alix-tz commented 7 months ago

The characters/members entry is simply the character set present in the ground truth. It was generated with Chocomufin (https://github.com/alix-tz/gt_structure_text/actions/runs/7916917450/job/21611834392).

alix-tz commented 7 months ago

Also, I am not sure the choice of "Latin" and "Gothic" is correct to describe the script. In the ISO norm, "Gothic" refers to a different type of script (https://en.m.wikipedia.org/wiki/Gothic_alphabet). I remember discussions with Tobias Hodel which led us to include a specifier for script for such cases where the script would be Latin, completemented with a specifier such as "fraktur" (see https://github.com/HTR-United/schema/issues/4, and also not that I am not a specialist of this font specifically).

alix-tz commented 7 months ago

The characters/members entry is simply the character set present in the ground truth. It was generated with Chocomufin (https://github.com/alix-tz/gt_structure_text/actions/runs/7916917450/job/21611834392).

Oh, I just realized that Chocomufin doesn't support PAGE XML yet. My bad, it explains the weird character set. Let's leave the character set out of the entry for now. I the rest ok for you?

tboenig commented 7 months ago

Hi @alix-tz , sorry for the long time for wait. The metadata set is ok, without the charactar set from Chocomufin. You can see my decision or proposal for charactar documentation and mapping to transcription and structure level under: https://ocr-d.de/gt_structure_text/overview-level Thank you.

tboenig commented 7 months ago

Hi @alix-tz, I'm just wondering why the data set has not yet been included in the catalog. Are there still errors? Should I make corrections? In the hope that the record will be added soon.

Best regards from Berlin tboenig

bertsky commented 7 months ago

@tboenig I suggest preparing a pull request. On your fork, create a directory ocr-d under https://github.com/HTR-United/htr-united/tree/master/catalog and put your metadata into a file gt_structure_text.yml there (assuming we will put many more files from other OCR-D datasets under that same directory hence). Then make a commit under a new branch, then open a PR from that against upstream.

alix-tz commented 6 months ago

linked to #141