[catalog] Nouveau repo {POPP/POPP-datasets}

Shulk97 commented 2 years ago

Description du jeu de données

Checklist

[X] le nom du corpus est exprimé explicitement
[X] le nom du projet est exprimé explicitement
[X] les auteur-rices et les rôles sont exprimés explicitement
[X] une license est associée au jeu de données
[X] le jeu de données est clairement et explicitement décrit, de manière à permettre aux autres utilisateurs de comprendre son contenu et le contexte de sa création
[X] le jeu de données utilise des formats standards comme PAGE XML ou ALTO XML et les transcriptions sont alignées avec des images

Informations inmportantes

The POPP datasets¹:
POPP²:

description générée à l'aide de notre formulaire:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: The POPP datasets
url: https://zenodo.org/record/6581158
authors:
- name: Thomas
surname: Constum
roles:
  - aligner
  - quality-control
  - support
- name: Nicolas
surname: Kempf
- name: Pierrick
surname: Tranouez
- name: Thierry
surname: Paquet
roles:
  - project-manager
- name: Sandra
surname: Brée
orcid: 0000-0002-2802-5563
roles:
  - transcriber
  - project-manager
- name: François
surname: Merveille
roles:
  - transcriber
institutions: []
description: >-
The POPP datasets is a set of 3 datasets created within the POPP project
(Project for the Oceration of the Paris Population Census) for the task of
handwriting text recognition. These datasets have been published in
"Recognition and information extraction in historical handwritten tables:
toward understanding early 20th century Paris census" at DAS 2022.

The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée
d’Antin” and contains lines made from the extracted rows of census tables from
1926. Each table in the Paris census contains 30 rows, thus each page in these
datasets corresponds to 30 lines.
project-name: Project for the Oceration of the Paris Population Census
project-website: https://popp.hypotheses.org
language:
- fra
production-software: Pytorch
script:
- iso: Latn
script-type: only-manuscript
time:
notBefore: '1926'
notAfter: '1926'
hands:
count: more-than-10
precision: estimated
license:
- name: CC-BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
- metric: lines
count: 7050
transcription-guidelines: >
The text is transcribed as in the image (no correction of mispelling, no
resolution of abbreviation).

Since the lines are extracted from table rows, we defined 4 special characters
to describe the structure of the text:
  ¤ : indicates an empty cell
  / : indicates the separation into columns
  ? : indicates that the content of the cell following this symbol is written above the regular baseline
  ! : indicates that the content of the cell following this symbol is written below the regular baseline

Autonomie

Cocher la situation applicable :

[X] Je sais comment faire une Pull Request et je m'occupe de créer un dossier + fichier correspondant à mon dépôt dans "htr-united/catalog/"
[ ] Je ne sais pas comment faire une Pull Request, j'ai besoin d'aide pour ajouter une description de mon jeu de données sous "htr-united/catalog/"

1: Ce nom sera utilisé pour créer le fichier YAML dédié au jeu de données. Par exemple : si votre jeu de données s'appelle "Mon Super Dataset", sa description sera enregistrée sous "mon-super-dataset.yml"

2: Ce nom sera utlisé pour créer un dossier dans "catalog/", il contiendra toutes les descriptions des jeux de données liés à ce projet. Par exemple : si vous projet s'appelle "Mon Super Projet", le(s) fichier(s) YAML sera(ont) enregistrés sous "catalog/mon-super-projet/"

alix-tz commented 2 years ago

Merci beaucoup pour cette contribution !

alix-tz commented 2 years ago

done with #75

HTR-United / htr-united