HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents
https://htr-united.github.io
Creative Commons Zero v1.0 Universal
39 stars 32 forks source link

Adding dataset EpiSearch (Astori’s letters) #121

Closed federico-boschetti closed 1 year ago

federico-boschetti commented 1 year ago

Hello ! [We are glad to send you the metadata related to the dataset described in https://doi.org/10.5281/zenodo.7719291]

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: EpiSearch HTR
url: https://github.com/vedph/episearch-htr
authors:
  - name: Lorenzo
    surname: Calvelli
    orcid: 0000-0002-0920-9156
    roles:
      - project-manager
  - name: Tatiana
    surname: Tommasi
    orcid: 0009-0000-2815-0113
    roles:
      - transcriber
  - name: Federico
    surname: Boschetti
    orcid: 0000-0002-7810-7735
    roles:
      - support
institutions: []
description: Ground Truth for Astori’s letters (see the README.md file for details)
project-name: EpiSearch
project-website: https://github.com/vedph/episearch-htr
language:
  - ita
production-software: eScriptorium + Kraken
script:
  - iso: Latn
script-type: only-manuscript
time:
  notBefore: '1705'
  notAfter: '1709'
hands:
  count: '1'
  precision: exact
license:
  - name: CC-BY-SA 4.0
    url: https://creativecommons.org/licenses/by-sa/4.0/
format: Alto-XML
volume:
  - metric: files
    count: 34
alix-tz commented 1 year ago

Hello @federico-boschetti!

Thank you for your contribution! I made #122 to add the dataset description to the catalog.

I have two questions regarding the dataset:

  1. I saw that some lines are not segmented or transcribed. It's not a problem, but I just wanted to make sure it is intentional.

  2. regarding the organization of the repository, I think it would be easier to users if you put all the JPEG and the XML files in a data/ folder, in stead of having them all at the root level. (like what we suggested in the template). Do you think you could do this ?

Otherwise, as far as the description is concerned, it's all good for merging

federico-boschetti commented 1 year ago

Hello @alix-tz ! Thank you for your feed-back.

  1. Omissions are intentional (introductory formulae and signatures were over-represented and lowered the performance of the training);
  2. I created the "data" directory and I filled it with images and XML files, as you suggested.
alix-tz commented 1 year ago

Awesome! I just confirmed the addition of the description of the dataset to the catalog.

Thank you again!