HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents
https://htr-united.github.io
Creative Commons Zero v1.0 Universal
39 stars 32 forks source link

Adding dataset Belfort #118

Closed starride-teklia closed 1 year ago

starride-teklia commented 1 year ago

Hi!

We would like to share a dataset from the Belfort City Council.

Transcriptions are in .txt format, is this acceptable to you? We have up to four transcriptions for each text-line (two from annotators, two from automatic models) and I am not sure if this is compatible with the PAGE XML format.

The aim of this dataset is to explore strategies for data selection and model training when multiple uncertain transcriptions are available (see our paper).

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Belfort
url: https://zenodo.org/record/8041668
authors:
  - name: Solène
    surname: Tarride
    orcid: 0000-0001-6174-9865
  - name: Tristan
    surname: Faine
  - name: Mélodie
    surname: Boillet
    orcid: 0000-0002-0618-7852
  - name: Harold
    surname: Mouchère
    orcid: 0000-0001-6220-7216
  - name: Christopher
    surname: Kermorvant
    orcid: 0000-0002-7508-4080
institutions: []
description: >-
  This dataset includes minutes of Belfort municipal council drawn up between
  1790 and 1946. Documents include deliberations, lists of councillors,
  convocations, and agendas. The dataset includes 24,105 text-line images that
  were automatically detected from pages. Up to 4 transcriptions are available
  for each line image: two from humans, and two from automatic models.
project-name: Handwritten Text Recognition from Crowdsourced Annotations
project-website: https://arxiv.org/abs/2306.10878
language:
  - fra
production-software: Callico
script:
  - iso: Latn
script-type: only-manuscript
time:
  notBefore: '1790'
  notAfter: '1946'
hands:
  count: more-than-10
  precision: estimated
license:
  - name: CC-BY 4.0
    url: https://creativecommons.org/licenses/by/4.0/
format: Page-XML
sources:
  - reference: >-
      Solène Tarride, Tristan Faine, Mélodie Boillet, Harold Mouchère, &
      Christopher Kermorvant. (2023). The Belfort dataset: Handwritten Text
      Recognition from Crowdsourced Annotations [Data set]. 7th International
      Workshop on Historical Doc- ument Imaging and Processing (HIP'23), San
      José, California, USA. Zenodo. https://doi.org/10.5281/zenodo.8041668
    link: ''
volume:
  - metric: lines
    count: 24105
PonteIneptique commented 1 year ago

Hi @starride-teklia ! I think we already accepted line-level datasets. I need to check why this is not proposed by the form.

Woud you be so kind to clarify in your description where the ground truth lies in the Transcriptions folder ? That would allow people to more easily use the dataset, potentially without getting surprised at the structure of the zip ?

PonteIneptique commented 1 year ago

Using this information, I will count the character volume and add your dataset to HTR-United

starride-teklia commented 1 year ago

Hi @PonteIneptique, thanks for your very quick reply!

Here is the YAML file with the updated description, I hope it is clearer this way:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Belfort
url: https://zenodo.org/record/8041668
authors:
  - name: Solène
    surname: Tarride
    orcid: 0000-0001-6174-9865
  - name: Tristan
    surname: Faine
  - name: Mélodie
    surname: Boillet
    orcid: 0000-0002-0618-7852
  - name: Harold
    surname: Mouchère
    orcid: 0000-0001-6220-7216
  - name: Christopher
    surname: Kermorvant
    orcid: 0000-0002-7508-4080
institutions: []
description: >
  This dataset includes minutes of Belfort municipal council drawn up between
  1790 and 1946. Documents include deliberations, lists of councillors,
  convocations, and agendas. The dataset includes 24,105 text-line images that
  were automatically detected from pages. 

  Up to four transcriptions are available for each line image: 

  * two from human annotators (in `Transcriptions/callico_1/` and
  `Transcriptions/callico_2/`)

  * two from automatic models (in `Transcriptions/dan/` and
  `Transcriptions/pylaia/`) 
project-name: Handwritten Text Recognition from Crowdsourced Annotations
project-website: https://arxiv.org/abs/2306.10878
language:
  - fra
production-software: Callico
script:
  - iso: Latn
script-type: only-manuscript
time:
  notBefore: '1790'
  notAfter: '1946'
hands:
  count: more-than-10
  precision: estimated
license:
  - name: CC-BY 4.0
    url: https://creativecommons.org/licenses/by/4.0/
format: Page-XML
sources:
  - reference: >-
      Solène Tarride, Tristan Faine, Mélodie Boillet, Harold Mouchère, &
      Christopher Kermorvant. (2023). The Belfort dataset: Handwritten Text
      Recognition from Crowdsourced Annotations [Data set]. 7th International
      Workshop on Historical Doc- ument Imaging and Processing (HIP'23), San
      José, California, USA. Zenodo. https://doi.org/10.5281/zenodo.8041668
    link: ''
volume:
  - metric: lines
    count: 24105
alix-tz commented 1 year ago

Hello! Thank you for your contribution!

We will have to change the value in the format field since it's not PageXML but pairs of line and text.

@PonteIneptique : It will have an impact on the schema because in the current definition, we only allow these 2 values:

    "format": {
        "description": "Format of the ground truth",
        "type": "string",
        "enum": ["Alto-XML", "Page-XML"]
    },

I think it's time to open a new issue in the schema!

PonteIneptique commented 1 year ago

It's now possible ;) I'll make the PR