HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents
https://htr-united.github.io
Creative Commons Zero v1.0 Universal
37 stars 31 forks source link

Adding dataset Ground Truth data for printed Devanagari #89

Closed nidame closed 1 year ago

nidame commented 1 year ago

Hello ! I'd like to include the metadata for my GT dataset on HTR United. The Alto XML files and the images are archived FID4SA@heiDATA, the research data repository of Heidelberg University. DOI to the dataset is included in the metadata. Hope it works! Please get in touch in case there are any questions. Best wishes, Nicole

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Ground truth data for printed Devanagari
url: https://doi.org/10.11588/data/EGOKEI
authors:
  - name: Merkel-Hilf
    surname: Nicole
    orcid: 0000-0002-0344-6169
    roles:
      - transcriber
      - project-manager
institutions: Heidelberg University Library
description: >-
  Ground truth (GT) data (jpg and alto xml files) for an OCR model that
  recognizes printed text in Devanagari script.

  The GT data was trained on Transkribus with the HTR+ engine. The training was
  performed on appr. 220 pages with appr. 27,000 words. The validation set was
  10% of the training set.

  The training material is comprised of letterpress printings from the Naval
  Kishore Press (Lakhnau, North India) from the late 19th and early 20th century
  in the Hindi, Sanskrit, Braj Bhasha and Awadhi languages.

  Transcription was performed by Nicole Merkel-Hilf (CATS Library / Heidelberg
  University Library) with support by Daria Peshcherova (CATS Library /
  Heidelberg University Library).
project-name: Naval Kishore Press - digital
project-website: https://digi.ub.uni-heidelberg.de/en/sammlungen/suedasien/navalkishore.html
language:
  - hin
  - san
  - bra
production-software: Transkribus
script:
  - iso: Deva
script-type: only-typed
time:
  notBefore: '1880'
  notAfter: '1953'
hands:
  count: less-than-11
  precision: exact
license:
  - name: CC-BY 4.0
    url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
  - metric: lines
    count: 4333
transcription-guidelines: Diplomatic transcription, no correction of mispelling
alix-tz commented 1 year ago

Hello! Thank you very much!

It looks pretty good to me :)

alix-tz commented 1 year ago

@ponteineptique will we be able to use HUMG on this dataset?

PonteIneptique commented 1 year ago

If it's PageXML, yes absolutely :)

nidame commented 1 year ago

It's alto xml but I can also export page xml from the Transkribus website, if necessary

PonteIneptique commented 1 year ago

ALTO XML is even better for HUMG :)

PonteIneptique commented 1 year ago

(Next time I'll read the proposed record before commenting)

nidame commented 1 year ago

:-))

nidame commented 1 year ago

@PonteIneptique @alix-tz Hi, I just wanted to ask if you could include the metadata of the Devanagari GT in the HTR-United catalogue. Couldn't find it when searching. And I've got new data - GT for the South Indian script Malayalam provided by Tuebingen University Library. Would you be interested in that as well? If yes, I'll write a new issue. Best wishes Nicole

alix-tz commented 1 year ago

Hello, I just checked the content of the dataset in Mayalam script and it looks good so yes, it would be really interesting to add it. Can you make another issue for it?

Just a note: importing the Page is eScriptorium works, but not the ALTO (because of 1 missing information in the file exported by Transkribus), so can you make sure to keep the Page version in the dataset ?

nidame commented 1 year ago

Before I start a new issue, could you please kindly give me any information on the Devanagari dataset I submitted in November?

alix-tz commented 1 year ago

I think this issue can be closed, the remaining discussion about the second dataset will be in #104