HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents
https://htr-united.github.io
Creative Commons Zero v1.0 Universal
36 stars 31 forks source link

Adding dataset ARletta #148

Closed lithlefranc closed 2 weeks ago

lithlefranc commented 1 month ago

Hello !

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: ARletta
url: zenodo.org/records/11191457
authors:
  - name: Lith
    surname: Lefranc
  - name: Ilja
    surname: Van Damme
  - name: Thibault
    surname: Clérice
  - name: Mike
    surname: Kestemont
institutions:
  - name: University of Antwerp
  - name: National Institute for Research in Digital Science and Technology, Paris
description: Open-source handwritten text recognition models for historic Dutch
project-name: Bias in History
project-website: https://www.bias-in-history.eu/
language:
  - nld
  - fra
production-software: eScriptorium + Kraken
automatically-aligned: false
script:
  - iso: Latn
script-type: only-manuscript
time:
  notBefore: '1600'
  notAfter: '1940'
hands:
  count: more-than-10
  precision: estimated
license:
  name: CC-BY-SA 4.0
  url: https://creativecommons.org/licenses/by-sa/4.0/
format: Page-XML
volume:
  - metric: lines
    count: 431359
  - metric: regions
    count: 44536
  - metric: pages
    count: 10267
  - metric: characters
    count: 14253206
transcription-guidelines: diplomatic transcription: all of the text was transcribed verbatim, preserving all of its original features:
  - orthography: preserve original spelling
  - abbreviations: do not expand abbreviations
  - capitalization: retain original use of uppercase and lowercase letters
  - punctuation: transcribe punctuation marks exactly as they appear, even of they are unconventional by modern standards
  - special characters: include any special characters or symbols as they appear
  - formatting: maintain original formatting such as underlining or strikethrough
  - errors and corrections: include all errors and corrections found in the text
  - non-interpretative: avoid interpreting or modernizing the text
  - use the '@' symbol for characters you can not read an tag them as 'unclear' on baseline level
  - tag marginal text as 'marginalia' and main body text as 'paragraph' on region level
alix-tz commented 2 weeks ago

Hello Lith,

Thank you very much for this contribution and sorry for the late response! I have created a PR corresponding to the addition of the dataset card in the catalog.

Regarding the description of the transcription guidelines, I think the description could be improved. Could you provide more details or refer to a transcription rulebook published somewhere else?

I have a remark that is not linked to the addition to HTR-United: on the Zenodo repo, you mention twice "/datasets/antw-expert: the image files and preprocessed transcription files for the Antwerp data (annotated by the expert);". Is the repository missing something or is it a typo?

lithlefranc commented 2 weeks ago

Hello Alix, Thanks for your remark on the Zenodo repository. That is a typo indeed. I have corrected it. Considering the transcription guidelines: do the adjustments suffice? Thank you! Best wishes, Lith

alix-tz commented 2 weeks ago

Hello,

Perfect, I just updated the yml file and merged the entry to the catalog :) Thank you again for your contribution!